﻿1
00:00:00,943 --> 00:00:01,790
- Good morning, everyone.

2
00:00:01,790 --> 00:00:05,730
And welcome to Alumni Day
on our 43rd Darwin Festival.

3
00:00:05,730 --> 00:00:08,060
We have a talk momentarily beginning,

4
00:00:08,060 --> 00:00:11,130
and then we have another
one this afternoon.

5
00:00:11,130 --> 00:00:14,700
Before I introduce one of our
alumni, Mr. Peter Shearstone,

6
00:00:14,700 --> 00:00:17,350
I just wanted to rewind
those in the audience

7
00:00:17,350 --> 00:00:20,680
who happen to be alumni
or old friends of biology

8
00:00:20,680 --> 00:00:25,680
to consider joining us for a
short, separate Zoom at 12:30.

9
00:00:25,880 --> 00:00:29,200
If you just email me, I can
send you that Zoom link.

10
00:00:29,200 --> 00:00:32,410
In addition, any students who
happen to be in Meier Hall

11
00:00:32,410 --> 00:00:34,740
after the second talk this afternoon,

12
00:00:34,740 --> 00:00:39,010
if you wanted to find your
way down to Meier Hall 414,

13
00:00:39,010 --> 00:00:42,323
one of the labs, I have
some surprises for you.

14
00:00:43,350 --> 00:00:44,280
Without further ado,

15
00:00:44,280 --> 00:00:47,180
I'd like to introduce one of our alumni,

16
00:00:47,180 --> 00:00:49,230
Mr. Peter Shearstone.

17
00:00:49,230 --> 00:00:52,610
Peter joined Thermo
Fisher Scientific in 2018

18
00:00:52,610 --> 00:00:56,730
as Vice President for Global
Quality and Regulatory Affairs,

19
00:00:56,730 --> 00:00:58,050
and is responsible for leading

20
00:00:58,050 --> 00:01:01,000
the company's global QARA team,

21
00:01:01,000 --> 00:01:04,630
which consists of over 8,000 employees.

22
00:01:04,630 --> 00:01:07,250
The company is headquartered
in Waltham, Mass,

23
00:01:07,250 --> 00:01:09,990
and is the world leader
in serving science,

24
00:01:09,990 --> 00:01:13,350
their mission being to
enable their customers

25
00:01:13,350 --> 00:01:17,400
to make the world a healthier,
cleaner, and safer place.

26
00:01:17,400 --> 00:01:20,550
Peter holds a BS in
Biology from Salem State,

27
00:01:20,550 --> 00:01:24,900
our very own department,
having graduated in 1989.

28
00:01:24,900 --> 00:01:27,340
And his company Thermo Fisher Scientific

29
00:01:27,340 --> 00:01:31,230
is honored to support
the 2022 Darwin Festival,

30
00:01:31,230 --> 00:01:33,630
and for the next four years as well,

31
00:01:33,630 --> 00:01:37,160
bringing the importance
of science to Salem State

32
00:01:37,160 --> 00:01:39,750
and as well as the wider community

33
00:01:39,750 --> 00:01:41,980
on the North Shore of Massachusetts.

34
00:01:41,980 --> 00:01:45,280
Peter will introduce our
guest speaker this morning,

35
00:01:45,280 --> 00:01:49,040
Dr. Tara Pelletier, who is also an alumna

36
00:01:49,040 --> 00:01:50,580
of the biology department.

37
00:01:50,580 --> 00:01:52,113
So, over to you, Peter.

38
00:01:53,530 --> 00:01:54,540
- Thanks, Dr. Fisher.

39
00:01:54,540 --> 00:01:56,883
Good to be here, honored to.

40
00:01:58,200 --> 00:02:01,510
I've been part of Darwin
Festivals years ago as a student,

41
00:02:01,510 --> 00:02:02,980
and of course today

42
00:02:02,980 --> 00:02:07,160
as your host for the Alumni Day.

43
00:02:07,160 --> 00:02:10,700
I'm very honored to
introduce Dr. Tara Pelletier,

44
00:02:10,700 --> 00:02:13,040
who is an assistant professor of biology

45
00:02:13,040 --> 00:02:15,860
at Radford University
in Radford, Virginia.

46
00:02:15,860 --> 00:02:19,120
Radford is a public
university founded at 1910,

47
00:02:19,120 --> 00:02:22,370
and is one of eight doctorate
granting universities

48
00:02:22,370 --> 00:02:24,940
in the Virginia educational system.

49
00:02:24,940 --> 00:02:28,853
She is a 2002 graduate of
Salem State University.

50
00:02:30,110 --> 00:02:31,260
Tara is broadly interested

51
00:02:31,260 --> 00:02:33,740
in the eco-evolutionary processes

52
00:02:33,740 --> 00:02:36,650
that have shaped current
biodiversity patterns.

53
00:02:36,650 --> 00:02:40,550
Her lab works to integrate
various types of data,

54
00:02:40,550 --> 00:02:44,730
but that being genetic, or
geographic, environmental,

55
00:02:44,730 --> 00:02:46,810
historical, morphological

56
00:02:46,810 --> 00:02:49,010
to understand these complex processes

57
00:02:49,010 --> 00:02:52,900
and there are several large
data sets that she's aggregated

58
00:02:52,900 --> 00:02:55,023
over the last few years
that lend themselves

59
00:02:55,023 --> 00:02:58,520
to research in computational biology.

60
00:02:58,520 --> 00:03:00,500
Her team conducts local field work

61
00:03:00,500 --> 00:03:03,220
with the Aquatic Ecology Laboratory

62
00:03:03,220 --> 00:03:05,160
to sample water and soil samples.

63
00:03:05,160 --> 00:03:08,310
They also sequence DNA in targeted insect

64
00:03:08,310 --> 00:03:11,840
and salamander species to
estimate genetic diversity

65
00:03:11,840 --> 00:03:14,140
and dispersal for conservation purposes.

66
00:03:14,140 --> 00:03:18,810
And I encourage all of you to
check out researchgate.net,

67
00:03:18,810 --> 00:03:21,640
her papers there are quite inspiring

68
00:03:21,640 --> 00:03:24,530
and very, very interesting as a scientist.

69
00:03:24,530 --> 00:03:26,030
So, without further ado,

70
00:03:26,030 --> 00:03:27,860
please join me welcoming Dr. Pelletier

71
00:03:27,860 --> 00:03:30,790
to the 43rd annual Darwin Festival.

72
00:03:30,790 --> 00:03:35,630
Her talk today is entitled
Big Data for Biodiversity.

73
00:03:35,630 --> 00:03:36,463
Dr. Pelletier.

74
00:03:38,180 --> 00:03:39,890
- I'm very excited to be here

75
00:03:39,890 --> 00:03:42,150
and share my research with you.

76
00:03:42,150 --> 00:03:44,830
So, like Peter said, I'm
an evolutionary biologist

77
00:03:44,830 --> 00:03:47,870
and my main interest lie

78
00:03:47,870 --> 00:03:52,133
in understanding how biodiversity
is created and maintained.

79
00:03:53,640 --> 00:03:55,720
But before I get started with my research,

80
00:03:55,720 --> 00:03:59,730
as an alumni I thought it would
make sense to share my path

81
00:03:59,730 --> 00:04:01,350
to how I got to where I am now,

82
00:04:01,350 --> 00:04:04,000
it's kind of a windy long path.

83
00:04:04,000 --> 00:04:06,010
So, I started at Salem State

84
00:04:06,010 --> 00:04:09,280
back when it was Salem State
College, not University.

85
00:04:09,280 --> 00:04:12,640
I wasn't even a biology
major when I started,

86
00:04:12,640 --> 00:04:14,660
but I had to take a science class,

87
00:04:14,660 --> 00:04:16,900
so I took biology 101 and 102

88
00:04:16,900 --> 00:04:19,593
with Doctors Buttner and Cuevas Ingle.

89
00:04:20,460 --> 00:04:23,630
And that's where I just
fell in love with biology,

90
00:04:23,630 --> 00:04:26,810
more specifically, just
science in general.

91
00:04:26,810 --> 00:04:28,830
And I often like to share,

92
00:04:28,830 --> 00:04:30,760
particularly with undergraduate students,

93
00:04:30,760 --> 00:04:32,730
that I did really, really poorly

94
00:04:32,730 --> 00:04:35,230
on my first biology exam ever.

95
00:04:35,230 --> 00:04:37,400
So, if that's you, don't fret,

96
00:04:37,400 --> 00:04:39,580
you just kind of have to figure out

97
00:04:39,580 --> 00:04:41,117
how to be a college student.

98
00:04:41,117 --> 00:04:42,380
And so, I figured that out

99
00:04:42,380 --> 00:04:44,650
and I worked for several years

100
00:04:44,650 --> 00:04:46,420
at the Cat Cove Marine Laboratory

101
00:04:46,420 --> 00:04:48,070
while I was a student there,

102
00:04:48,070 --> 00:04:49,470
and have lots of fond memories

103
00:04:49,470 --> 00:04:52,593
of being out on the water
and cleaning fish tanks.

104
00:04:53,660 --> 00:04:58,140
I also did some undergraduate
research with Dr. Paul Kelly.

105
00:04:58,140 --> 00:05:01,370
So, this image here is plethodon cinereus,

106
00:05:01,370 --> 00:05:03,840
this is the eastern red-backed salamander.

107
00:05:03,840 --> 00:05:06,070
And the goal of the
project that we conducted

108
00:05:06,070 --> 00:05:08,440
was to compare cover boards.

109
00:05:08,440 --> 00:05:11,360
So, these are just these
small squares of wood

110
00:05:11,360 --> 00:05:13,340
that we would put on the forest floor,

111
00:05:13,340 --> 00:05:15,860
and see how they compared
to natural cover,

112
00:05:15,860 --> 00:05:17,710
things like rocks and logs,

113
00:05:17,710 --> 00:05:19,030
where we would find salamanders

114
00:05:19,030 --> 00:05:22,333
as a way to monitor
salamander numbers over time.

115
00:05:23,240 --> 00:05:24,770
And I didn't really know it at the time,

116
00:05:24,770 --> 00:05:26,900
but getting to know these
plethodon salamanders

117
00:05:26,900 --> 00:05:28,850
would play a big role in the career path

118
00:05:28,850 --> 00:05:30,860
that I ended up taking.

119
00:05:30,860 --> 00:05:34,140
But after I graduated,
I had some odd jobs,

120
00:05:34,140 --> 00:05:35,590
I worked at Eastern Mountain Sports,

121
00:05:35,590 --> 00:05:36,883
I worked at a ski resort,

122
00:05:37,930 --> 00:05:39,510
but I spent a couple summers

123
00:05:39,510 --> 00:05:42,100
working for the Northeast
Mosquito Control.

124
00:05:42,100 --> 00:05:45,270
This was with Dr. Cuevas
Ingle from Salem State.

125
00:05:45,270 --> 00:05:48,320
And we would go out
and collect mosquitoes,

126
00:05:48,320 --> 00:05:51,543
identify them, and test
them for West Nile virus.

127
00:05:52,430 --> 00:05:53,370
And I love this job

128
00:05:53,370 --> 00:05:55,830
because I got to spend
a lot of time outside.

129
00:05:55,830 --> 00:05:57,740
And at the time,

130
00:05:57,740 --> 00:06:00,400
back then, I still thought
I was gonna be an ecologist

131
00:06:00,400 --> 00:06:04,903
or a wildlife biologist 'cause
I loved animals even insects,

132
00:06:06,680 --> 00:06:08,700
but I needed a more steady income

133
00:06:08,700 --> 00:06:11,040
so I took a job at
Charles River Laboratory,

134
00:06:11,040 --> 00:06:14,730
which is a company that
supports pharmaceutical research

135
00:06:14,730 --> 00:06:17,983
and realized that
definitely wasn't for me.

136
00:06:19,120 --> 00:06:21,920
Then I spent some time
working as a vet tech

137
00:06:21,920 --> 00:06:24,080
and thought about vet school
for a brief period of time,

138
00:06:24,080 --> 00:06:26,610
and also realized that that wasn't for me.

139
00:06:26,610 --> 00:06:30,170
So, I finally decided I
should go to grad school

140
00:06:30,170 --> 00:06:32,000
and I was living in
Portland, Oregon at the time

141
00:06:32,000 --> 00:06:34,740
so I went to Portland State University

142
00:06:34,740 --> 00:06:36,850
to talk to some biology professors there

143
00:06:36,850 --> 00:06:39,730
and figure out how to go to grad school

144
00:06:39,730 --> 00:06:41,240
and what that even meant.

145
00:06:41,240 --> 00:06:45,420
And eventually, a couple
professors agreed to co-advise me.

146
00:06:45,420 --> 00:06:47,570
I think they realized that I
was just gonna keep showing up

147
00:06:47,570 --> 00:06:50,763
and so they might as well let
me in their master's program.

148
00:06:51,690 --> 00:06:53,660
And one of my co-advisors

149
00:06:53,660 --> 00:06:56,063
asked me to take his
field hepatology class.

150
00:06:56,920 --> 00:06:59,110
It was the three-week
course in the Southwest,

151
00:06:59,110 --> 00:07:01,670
Desert Southwest before
the semester started.

152
00:07:01,670 --> 00:07:03,630
And so, before this trip

153
00:07:03,630 --> 00:07:06,590
I started reading a bunch
of scientific papers

154
00:07:06,590 --> 00:07:09,620
because I don't know, I wanted
to be able to impress him.

155
00:07:09,620 --> 00:07:13,000
And so, I was focusing mostly on papers

156
00:07:13,000 --> 00:07:14,240
on plethodon salamanders

157
00:07:14,240 --> 00:07:17,690
'cause it was something that
I kind of felt familiar with.

158
00:07:17,690 --> 00:07:18,600
So, we're on the trip

159
00:07:18,600 --> 00:07:20,810
and he asked me what I wanted to study

160
00:07:20,810 --> 00:07:24,440
and I have no idea why
this came out of my mouth,

161
00:07:24,440 --> 00:07:27,720
but I said the population
genetics of plethodon salamanders.

162
00:07:27,720 --> 00:07:29,610
And I'm not sure I even really knew

163
00:07:29,610 --> 00:07:30,620
what it meant at the time,

164
00:07:30,620 --> 00:07:33,530
but I didn't wanna say, "I don't know".

165
00:07:33,530 --> 00:07:35,090
And then that's what I ended up doing,

166
00:07:35,090 --> 00:07:37,780
so more specifically I studied

167
00:07:37,780 --> 00:07:40,750
the phylogeography of plethodon vehiculum.

168
00:07:40,750 --> 00:07:43,660
So, that's in this image down here.

169
00:07:43,660 --> 00:07:46,540
This is the western red-back salamander.

170
00:07:46,540 --> 00:07:48,230
They're not actually that closely related

171
00:07:48,230 --> 00:07:49,710
to the eastern red-back salamander,

172
00:07:49,710 --> 00:07:52,970
they're about 40 million years
divergent from one another,

173
00:07:52,970 --> 00:07:54,910
but they both do have wide distributions,

174
00:07:54,910 --> 00:07:57,070
one on the East Coast of the United States

175
00:07:57,070 --> 00:07:58,480
and the one that I study

176
00:07:58,480 --> 00:08:00,743
is on the West Coast of the United States.

177
00:08:01,890 --> 00:08:05,480
So, I studied the
phylogeography of this species,

178
00:08:05,480 --> 00:08:07,750
what is phylogeography?

179
00:08:07,750 --> 00:08:08,583
It is the study

180
00:08:08,583 --> 00:08:12,550
of the geographic distribution
of genetic variation.

181
00:08:12,550 --> 00:08:15,630
And we can examine patterns
in genetic variation

182
00:08:15,630 --> 00:08:17,820
and that will tell us a lot of information

183
00:08:17,820 --> 00:08:21,710
about how and why biodiversity
is shaped the way that it is.

184
00:08:21,710 --> 00:08:23,640
So, it's a little bit deeper in time

185
00:08:23,640 --> 00:08:27,263
than population genetics, but
the ideas are sort of similar.

186
00:08:28,100 --> 00:08:30,460
For example, genetic data has been used

187
00:08:30,460 --> 00:08:35,030
to test the out of Africa
hypothesis in humans.

188
00:08:35,030 --> 00:08:36,180
And what we can expect

189
00:08:36,180 --> 00:08:40,380
is that in the ancestral
range of the species,

190
00:08:40,380 --> 00:08:42,750
we have high levels of genetic diversity.

191
00:08:42,750 --> 00:08:46,610
As individuals migrate out
of that ancestral range,

192
00:08:46,610 --> 00:08:47,700
they take small portions

193
00:08:47,700 --> 00:08:49,650
of that genetic diversity with them.

194
00:08:49,650 --> 00:08:52,570
And what we see is lower
levels of genetic diversity

195
00:08:52,570 --> 00:08:55,343
at the leading edge of a range expansion.

196
00:08:58,000 --> 00:09:00,100
And so, we can use this information,

197
00:09:00,100 --> 00:09:03,520
how the geographic or
how the genetic variation

198
00:09:03,520 --> 00:09:06,030
is distributed across geographic space

199
00:09:06,030 --> 00:09:09,980
in phylogeography to
identify unknown species,

200
00:09:09,980 --> 00:09:12,470
make informed conservation decisions,

201
00:09:12,470 --> 00:09:15,420
and just better understand
the processes of evolution.

202
00:09:15,420 --> 00:09:17,690
So, things like migration,

203
00:09:17,690 --> 00:09:19,653
genetic drift, and natural selection.

204
00:09:20,560 --> 00:09:22,960
So, I thought that this
was just super cool

205
00:09:22,960 --> 00:09:25,355
and I should go get a PhD

206
00:09:25,355 --> 00:09:28,200
and continue studying phylogeography.

207
00:09:28,200 --> 00:09:30,000
So, I started my PhD work

208
00:09:30,000 --> 00:09:32,520
at Louisiana State
University in Baton Rouge.

209
00:09:32,520 --> 00:09:35,070
I was working with Brian Carsons

210
00:09:35,070 --> 00:09:37,340
who was a prominent phylogeographer

211
00:09:37,340 --> 00:09:39,930
doing work in the Pacific Northwest,

212
00:09:39,930 --> 00:09:41,960
which is where plethodon
vehiculum is found,

213
00:09:41,960 --> 00:09:44,623
the species I studied
for my master's degree.

214
00:09:45,720 --> 00:09:48,490
So, here I studied five
different species of plethodon.

215
00:09:48,490 --> 00:09:50,625
These images here, this is plethodon dunni

216
00:09:50,625 --> 00:09:52,304
and plethodon vehiculum.

217
00:09:52,304 --> 00:09:53,700
Dunni might be my favorite,

218
00:09:53,700 --> 00:09:56,053
although they all have cool features.

219
00:09:57,100 --> 00:09:59,590
I also got a field assistant,
this is my dog Shasta,

220
00:09:59,590 --> 00:10:01,860
who I never was actually able to train

221
00:10:01,860 --> 00:10:03,540
to sniff out salamanders

222
00:10:03,540 --> 00:10:06,900
so she was more of a
companion than an assistant,

223
00:10:06,900 --> 00:10:07,943
but she was great.

224
00:10:08,870 --> 00:10:10,730
But during this work,

225
00:10:10,730 --> 00:10:13,870
I started using more
sophisticated techniques

226
00:10:13,870 --> 00:10:15,520
to analyze my data.

227
00:10:15,520 --> 00:10:18,520
So, analyzing genetic data
can be really complicated,

228
00:10:18,520 --> 00:10:21,560
and in order to test hypotheses

229
00:10:21,560 --> 00:10:23,060
rather than just describe

230
00:10:23,060 --> 00:10:25,520
how the genetic variation is distributed,

231
00:10:25,520 --> 00:10:27,470
you have to be able to simulate data

232
00:10:27,470 --> 00:10:30,350
and develop models to
test those hypotheses.

233
00:10:30,350 --> 00:10:31,220
And in order to do that,

234
00:10:31,220 --> 00:10:33,140
you need to learn how to code,

235
00:10:33,140 --> 00:10:34,530
so I learned how to code.

236
00:10:34,530 --> 00:10:35,363
This is an image here

237
00:10:35,363 --> 00:10:37,700
of one of the first scripts I ever wrote,

238
00:10:37,700 --> 00:10:40,790
and it's modeling genetic
drift in a population.

239
00:10:40,790 --> 00:10:42,660
This image here is an image

240
00:10:42,660 --> 00:10:45,560
from the first chapter of my dissertation

241
00:10:45,560 --> 00:10:48,380
where I was examining demographic models

242
00:10:48,380 --> 00:10:50,470
in plethodon idahoensis.

243
00:10:50,470 --> 00:10:52,950
So, after several years, I...

244
00:10:52,950 --> 00:10:54,070
Oh, I forgot to mention,

245
00:10:54,070 --> 00:10:55,450
I have this Ohio State here.

246
00:10:55,450 --> 00:10:57,970
About halfway through
my dissertation work,

247
00:10:57,970 --> 00:11:00,540
my advisor switched positions

248
00:11:00,540 --> 00:11:02,720
and took a job at Ohio State University

249
00:11:02,720 --> 00:11:04,680
so the lab moved up to
Ohio State with him,

250
00:11:04,680 --> 00:11:08,690
and that's where I ended
up finishing my PhD at OSU.

251
00:11:08,690 --> 00:11:11,170
And I like to joke that that was the year

252
00:11:11,170 --> 00:11:12,830
I officially became an adult

253
00:11:12,830 --> 00:11:15,800
because I finally
finished going to college

254
00:11:15,800 --> 00:11:17,040
and I got married and had a baby

255
00:11:17,040 --> 00:11:18,963
all within this six-month period.

256
00:11:19,950 --> 00:11:22,750
But I spent a few more
years there teaching

257
00:11:22,750 --> 00:11:25,850
and just doing some more
phylogeographic research

258
00:11:25,850 --> 00:11:29,830
before I ended up in the
position that I'm in now.

259
00:11:29,830 --> 00:11:31,930
So, I'm an Assistant Professor of Biology

260
00:11:31,930 --> 00:11:33,493
at Radford University.

261
00:11:35,060 --> 00:11:36,910
So, when I did my masters,

262
00:11:36,910 --> 00:11:38,500
I studied the phylogeography

263
00:11:38,500 --> 00:11:41,660
of one species of plethodon salamander.

264
00:11:41,660 --> 00:11:43,500
Then when I did my PhD work,

265
00:11:43,500 --> 00:11:46,980
I studied several species
of plethodon salamander.

266
00:11:46,980 --> 00:11:49,390
And one of the things that I learned about

267
00:11:49,390 --> 00:11:52,290
about this was that when
you study multiple species

268
00:11:52,290 --> 00:11:55,030
and you could compare
them and contrast them,

269
00:11:55,030 --> 00:11:56,250
you can actually learn a little bit more

270
00:11:56,250 --> 00:11:57,880
about the evolutionary processes.

271
00:11:57,880 --> 00:12:01,900
So, if two species have
similar demographic history,

272
00:12:01,900 --> 00:12:04,290
similar patterns in their genetic data,

273
00:12:04,290 --> 00:12:06,410
but maybe one has a much
larger distribution,

274
00:12:06,410 --> 00:12:09,370
you can start thinking
about the physiology

275
00:12:09,370 --> 00:12:11,390
and the specific ecological niches

276
00:12:11,390 --> 00:12:14,333
that are shaping those
biodiversity patterns.

277
00:12:15,750 --> 00:12:17,700
So, me and some colleagues thought

278
00:12:17,700 --> 00:12:19,790
what if we did this even bigger, right?

279
00:12:19,790 --> 00:12:22,090
What if we asked questions like this

280
00:12:22,090 --> 00:12:25,420
in thousands of species on a global scale?

281
00:12:25,420 --> 00:12:28,500
Which really is kind of
a ridiculous undertaking,

282
00:12:28,500 --> 00:12:30,333
but we did it anyway.

283
00:12:31,680 --> 00:12:34,230
And so, what I'm gonna
share with you today,

284
00:12:34,230 --> 00:12:35,560
the first thing I'm gonna talk about

285
00:12:35,560 --> 00:12:37,080
is the data that we need.

286
00:12:37,080 --> 00:12:40,570
So, we aggregated a database
that we called phylogatR

287
00:12:40,570 --> 00:12:43,070
that puts together a lot of different

288
00:12:43,070 --> 00:12:44,500
type of phylogeographic data,

289
00:12:44,500 --> 00:12:47,130
and I'm gonna walk you
through those steps.

290
00:12:47,130 --> 00:12:48,140
Then I'm gonna describe

291
00:12:48,140 --> 00:12:50,450
a machine learning
predictive modeling technique

292
00:12:50,450 --> 00:12:52,150
that we've been using.

293
00:12:52,150 --> 00:12:54,780
When we have lots of lots of
data for thousands of species,

294
00:12:54,780 --> 00:12:57,370
it can be a little bit more
challenging to analyze,

295
00:12:57,370 --> 00:12:59,270
so we have to use these
machine learning techniques

296
00:12:59,270 --> 00:13:02,440
to take into account
data that is numerous,

297
00:13:02,440 --> 00:13:06,900
and messy, and potentially correlated.

298
00:13:06,900 --> 00:13:09,040
And then I'll share with you two examples

299
00:13:09,040 --> 00:13:10,557
about how we use these data

300
00:13:10,557 --> 00:13:12,380
and this machine learning approach

301
00:13:12,380 --> 00:13:15,140
to explore hidden diversity in mammals

302
00:13:15,140 --> 00:13:18,283
and prioritize conservation
efforts in plants.

303
00:13:21,280 --> 00:13:23,130
So first, the data.

304
00:13:23,130 --> 00:13:25,910
We live in a very data-rich age,

305
00:13:25,910 --> 00:13:28,500
we're producing data all
the time as humans, right?

306
00:13:28,500 --> 00:13:32,510
You Google something and that
information is being recorded.

307
00:13:32,510 --> 00:13:34,910
We all know that our
social media platforms

308
00:13:34,910 --> 00:13:37,100
are sharing our data with other companies

309
00:13:37,100 --> 00:13:40,483
and our Google Maps is always
keeping track of where we are.

310
00:13:41,330 --> 00:13:44,470
But if we think about like our
Google Map data, for example,

311
00:13:44,470 --> 00:13:47,570
our own individual Google
Map data is kind of boring.

312
00:13:47,570 --> 00:13:50,490
But once we put that
together in a large volume,

313
00:13:50,490 --> 00:13:51,980
it's actually really informative.

314
00:13:51,980 --> 00:13:54,100
If you see that red
line on your Google Map,

315
00:13:54,100 --> 00:13:55,600
you know that you don't
wanna take that route

316
00:13:55,600 --> 00:13:56,800
because it's super busy.

317
00:13:57,750 --> 00:14:00,250
It's the same thing in biology, right?

318
00:14:00,250 --> 00:14:02,730
If we have one data point where
we know a species is found

319
00:14:02,730 --> 00:14:05,810
in a particular geographic location,

320
00:14:05,810 --> 00:14:07,510
it doesn't really give
us a lot of information.

321
00:14:07,510 --> 00:14:10,590
But when we start putting data
together for lots of species

322
00:14:10,590 --> 00:14:12,170
and lots of geographic locations,

323
00:14:12,170 --> 00:14:13,770
we can get a lot of information.

324
00:14:14,610 --> 00:14:17,100
And biologists are out there
collecting data all the time,

325
00:14:17,100 --> 00:14:19,940
there are probably people
in the field right now

326
00:14:19,940 --> 00:14:22,620
collecting biodiversity data.

327
00:14:22,620 --> 00:14:25,780
And then they put those data
on these open source databases

328
00:14:25,780 --> 00:14:27,150
that people can use.

329
00:14:27,150 --> 00:14:28,230
So, this one here,

330
00:14:28,230 --> 00:14:31,430
this is the Global Biodiversity
information Facility,

331
00:14:31,430 --> 00:14:32,840
GBIF for short,

332
00:14:32,840 --> 00:14:35,740
and it has almost 2
billion occurrence records.

333
00:14:35,740 --> 00:14:39,100
And an occurrence record is
basically just information

334
00:14:39,100 --> 00:14:41,890
that says this species was in,

335
00:14:41,890 --> 00:14:45,663
this was observed in this
geographic location on this day.

336
00:14:46,560 --> 00:14:49,330
And once we have that
geographic information,

337
00:14:49,330 --> 00:14:51,440
we can get other information
about the species.

338
00:14:51,440 --> 00:14:53,500
So, we can take a GPS coordinate

339
00:14:53,500 --> 00:14:57,320
and we can extract
information from data layers.

340
00:14:57,320 --> 00:15:00,382
So, we can know what the elevation is,

341
00:15:00,382 --> 00:15:02,233
we can know what the average rainfall is,

342
00:15:02,233 --> 00:15:06,103
what the range in rainfall is,
what the landscape is like.

343
00:15:07,060 --> 00:15:08,690
Is it a grassland, a city?

344
00:15:08,690 --> 00:15:10,060
So on and so forth.

345
00:15:10,060 --> 00:15:11,310
So, we can get a lot of information

346
00:15:11,310 --> 00:15:13,553
about where that species is found.

347
00:15:14,650 --> 00:15:17,560
Additionally, there are genetic databases.

348
00:15:17,560 --> 00:15:19,360
So, this one down here, BOLD,

349
00:15:19,360 --> 00:15:22,060
it's the Barcode of Life Database

350
00:15:22,060 --> 00:15:25,980
and they have DNA sequence
data housed on this database.

351
00:15:25,980 --> 00:15:27,990
And a lot of them have GPS coordinates

352
00:15:27,990 --> 00:15:30,620
associated with those DNA sequences.

353
00:15:30,620 --> 00:15:32,510
There's also, so the NCBI GenBank,

354
00:15:32,510 --> 00:15:36,350
which has over 200 million DNA sequences.

355
00:15:36,350 --> 00:15:39,240
So, as phylogeographers, we are interested

356
00:15:39,240 --> 00:15:41,140
in this DNA sequence data,

357
00:15:41,140 --> 00:15:43,550
but we need to know where geographically

358
00:15:43,550 --> 00:15:45,760
that DNA sequence data came from.

359
00:15:45,760 --> 00:15:46,593
So, the problem

360
00:15:46,593 --> 00:15:49,370
is just putting all this
information together.

361
00:15:49,370 --> 00:15:53,110
So, BOLD usually does have GPS coordinates

362
00:15:53,110 --> 00:15:55,000
associated with their DNA sequences,

363
00:15:55,000 --> 00:15:57,820
but GenBank most of the time does not.

364
00:15:57,820 --> 00:16:02,180
However GBIF, this database
that has the occurrence records,

365
00:16:02,180 --> 00:16:06,520
often has IDs that allow
you to go on GenBank

366
00:16:06,520 --> 00:16:07,820
and pull a DNA sequence

367
00:16:07,820 --> 00:16:10,823
associated with an individual
from that occurrence record.

368
00:16:12,720 --> 00:16:17,200
So, we have a pipeline that
puts together all these data.

369
00:16:17,200 --> 00:16:20,470
Right, the first step
are primary researchers

370
00:16:20,470 --> 00:16:23,470
going out in the field or
sequencing DNA in the lab

371
00:16:23,470 --> 00:16:25,430
and collecting the original data,

372
00:16:25,430 --> 00:16:27,180
putting it on these databases.

373
00:16:27,180 --> 00:16:31,473
So, our step was to
pull the data that has,

374
00:16:32,450 --> 00:16:35,350
or pull the sequence data that
has geographic coordinates

375
00:16:35,350 --> 00:16:40,090
associated with it and merge
it all together into one place.

376
00:16:40,090 --> 00:16:43,710
The next steps are to organize
it in a way that makes sense

377
00:16:43,710 --> 00:16:44,543
so that we can analyze it.

378
00:16:44,543 --> 00:16:46,370
And one of the first things we have to do

379
00:16:46,370 --> 00:16:48,800
is check species names, right?

380
00:16:48,800 --> 00:16:50,250
A lot of these data are messy,

381
00:16:50,250 --> 00:16:53,740
sometimes there are
typos in species names,

382
00:16:53,740 --> 00:16:55,480
sometimes species names change,

383
00:16:55,480 --> 00:16:58,780
so we have to standardize
the taxonomic information

384
00:16:58,780 --> 00:17:00,130
to the best of our ability.

385
00:17:01,120 --> 00:17:05,170
The next thing we need to
do is group data together

386
00:17:05,170 --> 00:17:07,510
based on like where it
comes from or what it is.

387
00:17:07,510 --> 00:17:09,420
So, when we sequence DNA,

388
00:17:09,420 --> 00:17:12,500
we're usually sequencing
several hundred base pairs

389
00:17:12,500 --> 00:17:15,650
or nucleotides or several
thousand base pairs,

390
00:17:15,650 --> 00:17:19,570
nucleotides from one
specific region of a genome.

391
00:17:19,570 --> 00:17:21,610
And so, we need to group
all the DNA sequences

392
00:17:21,610 --> 00:17:24,300
that come from the same
region of the genome together,

393
00:17:24,300 --> 00:17:25,930
and then group all the DNA sequences

394
00:17:25,930 --> 00:17:27,870
from the same species together,

395
00:17:27,870 --> 00:17:31,090
and we put those in what we
call a DNA sequence alignment.

396
00:17:31,090 --> 00:17:34,560
And so, it's sort of like a data frame

397
00:17:34,560 --> 00:17:35,800
that you might have seen before

398
00:17:35,800 --> 00:17:39,610
where on every single row we
have an individual species

399
00:17:39,610 --> 00:17:41,570
or an individual specimen,

400
00:17:41,570 --> 00:17:44,510
and then every single column
is a different nucleotide

401
00:17:44,510 --> 00:17:46,360
or position in the genome,

402
00:17:46,360 --> 00:17:49,953
and this is the format in which
we can analyze genetic data.

403
00:17:51,270 --> 00:17:52,850
There are also many other
steps we have to take

404
00:17:52,850 --> 00:17:53,683
to clean the data.

405
00:17:53,683 --> 00:17:55,610
So, like I said, these data can be messy,

406
00:17:55,610 --> 00:17:58,010
sometimes they're duplicates,
and record changes,

407
00:17:58,010 --> 00:18:00,270
and just other things
that look suspicious.

408
00:18:00,270 --> 00:18:02,990
I'm not gonna walk through the details

409
00:18:02,990 --> 00:18:05,610
because it's a little tedious
and not that exciting,

410
00:18:05,610 --> 00:18:09,600
but in this paper that we
have describing our database,

411
00:18:09,600 --> 00:18:10,940
we do have two other flow charts

412
00:18:10,940 --> 00:18:13,450
that walkthrough all
these steps that we made.

413
00:18:13,450 --> 00:18:15,710
So, even though I'm not
gonna go in the details,

414
00:18:15,710 --> 00:18:17,420
it's important to know

415
00:18:17,420 --> 00:18:19,620
that these data cleaning
steps are important.

416
00:18:19,620 --> 00:18:22,080
Maybe some of you have heard the term

417
00:18:22,080 --> 00:18:23,350
garbage in, garbage out.

418
00:18:23,350 --> 00:18:25,000
And basically that's saying

419
00:18:25,000 --> 00:18:28,680
if your data is bad, your
results are gonna be crap, right?

420
00:18:28,680 --> 00:18:30,180
So, we have to have good data.

421
00:18:31,110 --> 00:18:33,280
So, we take all these steps to clean it,

422
00:18:33,280 --> 00:18:36,540
and organize it, and put it in
a format that we can analyze.

423
00:18:36,540 --> 00:18:39,510
And then what we end up it
with is our final data set,

424
00:18:39,510 --> 00:18:42,610
and so we have this organized
in a way that we know

425
00:18:42,610 --> 00:18:45,280
what data is where and what's in there.

426
00:18:45,280 --> 00:18:49,730
So, what we ended up with was
over 2 million DNA sequences

427
00:18:49,730 --> 00:18:52,350
that have GPS coordinates
associated with them

428
00:18:52,350 --> 00:18:55,440
for over 87,000 species.

429
00:18:55,440 --> 00:18:57,540
And you might recall that this
is actually only a fraction

430
00:18:57,540 --> 00:18:59,190
of the data that we started with, right?

431
00:18:59,190 --> 00:19:02,330
There were hundreds of millions
of DNA sequences available,

432
00:19:02,330 --> 00:19:05,280
and I'm gonna bring this up
again at the end of the talk.

433
00:19:05,280 --> 00:19:06,330
But regardless of that,

434
00:19:06,330 --> 00:19:10,270
this is still the biggest
data set of its kind

435
00:19:10,270 --> 00:19:12,590
that exists right now.

436
00:19:12,590 --> 00:19:14,140
It's so big, you could kind of imagine

437
00:19:14,140 --> 00:19:16,150
if you had an Excel spreadsheet

438
00:19:16,150 --> 00:19:20,210
with over 2.6 million rows

439
00:19:20,210 --> 00:19:22,740
and thousands and thousands
of columns, right?

440
00:19:22,740 --> 00:19:26,010
Even Excel couldn't even
open a fraction of that data,

441
00:19:26,010 --> 00:19:27,280
it would just crash on you, right?

442
00:19:27,280 --> 00:19:28,830
So, we need to have special tools

443
00:19:28,830 --> 00:19:31,080
to be able to analyze these data.

444
00:19:34,031 --> 00:19:36,010
So, that brings me to the methods

445
00:19:36,010 --> 00:19:38,638
that we're using to analyze these data.

446
00:19:38,638 --> 00:19:41,206
So, genetic data is
complicated to analyze,

447
00:19:41,206 --> 00:19:42,610
and I haven't even talked yet

448
00:19:42,610 --> 00:19:46,303
about how we bring in the
geographic and environmental data,

449
00:19:47,480 --> 00:19:49,810
but we've been using these
predictive modeling techniques

450
00:19:49,810 --> 00:19:52,380
to handle these really large data sets.

451
00:19:52,380 --> 00:19:53,560
So, the first step,

452
00:19:53,560 --> 00:19:55,350
if you were to go Google
predictive modeling,

453
00:19:55,350 --> 00:19:57,630
it often comes up as like in an industry

454
00:19:57,630 --> 00:19:59,800
let's make money perspective,

455
00:19:59,800 --> 00:20:02,293
but we can use these
same tools in biology.

456
00:20:03,450 --> 00:20:06,240
So, the first thing is
to define our problem.

457
00:20:06,240 --> 00:20:08,140
In our case broadly,

458
00:20:08,140 --> 00:20:10,620
we wanna explain biodiversity patterns

459
00:20:10,620 --> 00:20:13,050
and hopefully protect biodiversity.

460
00:20:13,050 --> 00:20:15,300
The next steps are to collect data

461
00:20:15,300 --> 00:20:18,560
and clean the data, which
we already talked about.

462
00:20:18,560 --> 00:20:20,430
Then we need to analyze our data

463
00:20:20,430 --> 00:20:21,707
and develop these predictive models.

464
00:20:21,707 --> 00:20:23,940
And these two things
kind of go hand in hand

465
00:20:23,940 --> 00:20:26,160
where we're often developing models

466
00:20:26,160 --> 00:20:28,023
that have good predictive power,

467
00:20:29,270 --> 00:20:32,210
exploring the variables that we have

468
00:20:32,210 --> 00:20:34,620
to make these predictions and
kind of going back and forth

469
00:20:34,620 --> 00:20:37,550
until we develop a useful model.

470
00:20:37,550 --> 00:20:40,370
And then we make our predictions.

471
00:20:40,370 --> 00:20:41,870
The final step would be to see

472
00:20:41,870 --> 00:20:43,770
if those predictions hold true, right?

473
00:20:43,770 --> 00:20:45,320
Did our model do a good job?

474
00:20:45,320 --> 00:20:46,700
And we're not quite there yet

475
00:20:46,700 --> 00:20:48,680
in the research that I'm doing,

476
00:20:48,680 --> 00:20:50,360
but that is something
we could keep in mind

477
00:20:50,360 --> 00:20:51,620
in years down the road

478
00:20:51,620 --> 00:20:56,120
like are these predictive
modeling techniques with big data

479
00:20:56,120 --> 00:20:59,560
actually doing the thing that
we think that they're doing?

480
00:20:59,560 --> 00:21:02,750
But in short, the way
that I like to describe

481
00:21:02,750 --> 00:21:06,110
predictive modeling is that
we are predicting the response

482
00:21:06,110 --> 00:21:08,540
for an outcome that we don't know

483
00:21:08,540 --> 00:21:12,430
based on the response of past
outcomes that we do know,

484
00:21:12,430 --> 00:21:16,810
and I'm gonna walk you through
briefly how this might look.

485
00:21:16,810 --> 00:21:19,140
So, let's say we have a data frame,

486
00:21:19,140 --> 00:21:22,623
and this case, every single
row is a different movie,

487
00:21:23,710 --> 00:21:26,610
and this second column here
is my score of the movie.

488
00:21:26,610 --> 00:21:29,150
So, I've seen the movie,
I have a response to it,

489
00:21:29,150 --> 00:21:31,920
and I scored it from one through 10.

490
00:21:31,920 --> 00:21:34,190
So, you can see that I
really like "Star Wars"

491
00:21:34,190 --> 00:21:35,550
and "Wonder Woman",

492
00:21:35,550 --> 00:21:38,460
not as big a fan as some of
these Disney movies, right?

493
00:21:38,460 --> 00:21:42,810
But we have all these different
occurrences with a response.

494
00:21:42,810 --> 00:21:45,640
Every single one of these
movies has a set of variables,

495
00:21:45,640 --> 00:21:47,640
which we're gonna call predictor variables

496
00:21:47,640 --> 00:21:49,410
that describe those movies.

497
00:21:49,410 --> 00:21:51,230
So, who is the director?

498
00:21:51,230 --> 00:21:53,010
What is the genre of the movie?

499
00:21:53,010 --> 00:21:54,110
What's the plot?

500
00:21:54,110 --> 00:21:58,280
What are some other scores
from other organizations?

501
00:21:58,280 --> 00:22:00,550
How do they rate those movies?

502
00:22:00,550 --> 00:22:02,550
So, we have all this
information that describes

503
00:22:02,550 --> 00:22:04,240
each of the movies that I've rated.

504
00:22:04,240 --> 00:22:05,430
And we can ask a question

505
00:22:05,430 --> 00:22:08,280
like how would I score
the movie "Eternals",

506
00:22:08,280 --> 00:22:09,330
which I have haven't seen,

507
00:22:09,330 --> 00:22:11,480
based on this set of predictor variables?

508
00:22:11,480 --> 00:22:12,640
And so, we have all the same

509
00:22:12,640 --> 00:22:14,250
predictor variables for "Eternals",

510
00:22:14,250 --> 00:22:17,333
the only thing we don't have is my score,

511
00:22:17,333 --> 00:22:18,570
a response variable,

512
00:22:18,570 --> 00:22:20,570
but we can predict how I might score it.

513
00:22:21,490 --> 00:22:23,520
This is the kind of thing
that Netflix does, right?

514
00:22:23,520 --> 00:22:25,770
Whenever you like movies,

515
00:22:25,770 --> 00:22:27,880
it takes all the information
about those movies

516
00:22:27,880 --> 00:22:31,070
and tries to make recommendations for you

517
00:22:31,070 --> 00:22:33,590
for movies that you might wanna see.

518
00:22:33,590 --> 00:22:36,660
So, we've been using this
random forest technique,

519
00:22:36,660 --> 00:22:38,550
it's a machine learning approach

520
00:22:38,550 --> 00:22:42,180
that uses decision trees
to predict a response.

521
00:22:42,180 --> 00:22:43,130
So, the way this works,

522
00:22:43,130 --> 00:22:46,110
this is a decision tree
here is that on every node

523
00:22:46,110 --> 00:22:50,250
in this tree is one of
our predictor variables.

524
00:22:50,250 --> 00:22:52,170
So, in this case, is it an action movie?

525
00:22:52,170 --> 00:22:54,450
And everything that is an action movie

526
00:22:54,450 --> 00:22:56,300
will go to one side of the tree

527
00:22:56,300 --> 00:22:59,060
and everything that isn't
will go to another side

528
00:22:59,060 --> 00:23:00,300
of the tree.

529
00:23:00,300 --> 00:23:02,630
Then we add in another predictor variable

530
00:23:02,630 --> 00:23:04,910
is the IMDB score greater than five?

531
00:23:04,910 --> 00:23:06,710
All those movies will go here.

532
00:23:06,710 --> 00:23:10,210
If it's less than .05, all
those movies will go here.

533
00:23:10,210 --> 00:23:12,760
And then we go through all
the predictor variables

534
00:23:12,760 --> 00:23:15,840
that we have until we get
to the tips of the tree,

535
00:23:15,840 --> 00:23:18,950
and on the tips of the
trees would be my scores.

536
00:23:18,950 --> 00:23:23,060
So, what we would do is
take the unknown response,

537
00:23:23,060 --> 00:23:25,700
so in this case, the movie "Eternals",

538
00:23:25,700 --> 00:23:27,730
and we just walk it
through the tree, right?

539
00:23:27,730 --> 00:23:28,850
Is it in action movie?

540
00:23:28,850 --> 00:23:29,900
What's the IMDB score?

541
00:23:29,900 --> 00:23:31,710
And go through all those variables,

542
00:23:31,710 --> 00:23:32,960
and wherever it ends up,

543
00:23:32,960 --> 00:23:35,150
whatever tip it ends up in the tree,

544
00:23:35,150 --> 00:23:37,963
that is our prediction for my score.

545
00:23:38,920 --> 00:23:41,910
However, one decision tree by itself

546
00:23:41,910 --> 00:23:43,730
is not always a good predictor,

547
00:23:43,730 --> 00:23:47,020
and that's because depending
on the order of the variables

548
00:23:47,020 --> 00:23:48,800
that we put in the tree,

549
00:23:48,800 --> 00:23:52,020
the "Eternals" movie might
take a slightly different path.

550
00:23:52,020 --> 00:23:53,580
So, we make a forest of trees

551
00:23:53,580 --> 00:23:56,860
where the input order of
the variables is random

552
00:23:56,860 --> 00:23:58,970
and different in every single tree.

553
00:23:58,970 --> 00:24:01,470
So, we put the movie "Eternals",

554
00:24:01,470 --> 00:24:03,820
it follows a different path
in all the different trees.

555
00:24:03,820 --> 00:24:05,470
And if it consistently came out

556
00:24:05,470 --> 00:24:08,590
with a score of eight or
nine, or eight or nine,

557
00:24:08,590 --> 00:24:11,590
right, our prediction would end
up being something like 8.5.

558
00:24:12,750 --> 00:24:15,210
So, this method is great
because it's relatively quick,

559
00:24:15,210 --> 00:24:17,460
it's easy to assess model performance

560
00:24:17,460 --> 00:24:21,890
so we can see if our model

561
00:24:21,890 --> 00:24:24,350
is actually making accurate predictions,

562
00:24:24,350 --> 00:24:26,700
and it allows to see
which predictor variables

563
00:24:26,700 --> 00:24:27,820
are the most important.

564
00:24:27,820 --> 00:24:31,460
So, for example, if movies
that I tend to score high

565
00:24:31,460 --> 00:24:34,380
always have like action and magic,

566
00:24:34,380 --> 00:24:35,960
we can pull that out of the data

567
00:24:35,960 --> 00:24:38,910
and say that that's an
important predictor variable.

568
00:24:38,910 --> 00:24:42,500
If my score doesn't generally
match up with the IMDB score,

569
00:24:42,500 --> 00:24:43,360
you know, we can say

570
00:24:43,360 --> 00:24:45,760
that that's not an important
predictor variable.

571
00:24:47,350 --> 00:24:51,423
So, how can we use these methods
to understand biodiversity?

572
00:24:52,640 --> 00:24:56,870
So, one of the most basic
areas of research in biology

573
00:24:56,870 --> 00:24:59,160
is documenting biodiversity.

574
00:24:59,160 --> 00:25:02,410
So, the Catalogue of Life is
the most comprehensive list

575
00:25:02,410 --> 00:25:05,810
of species that exist on planet Earth,

576
00:25:05,810 --> 00:25:09,320
right now at how just over
2 million species listed.

577
00:25:09,320 --> 00:25:11,390
And I really love their mission statement

578
00:25:11,390 --> 00:25:14,300
so I'm not gonna try to put
something in my own words,

579
00:25:14,300 --> 00:25:18,170
but they say, "Without accurate
documentation of species,

580
00:25:18,170 --> 00:25:21,040
we cannot sustainably
use, explore, monitor,

581
00:25:21,040 --> 00:25:24,140
manage, and protect
biodiversity resources."

582
00:25:24,140 --> 00:25:26,593
Right, and so managing and
protecting biodiversity

583
00:25:26,593 --> 00:25:30,530
is the main mission of biologists.

584
00:25:30,530 --> 00:25:33,570
And, you know, we could
argue whether biodiversity

585
00:25:33,570 --> 00:25:36,370
has innate value in and of itself.

586
00:25:36,370 --> 00:25:37,890
I would argue that it does,

587
00:25:37,890 --> 00:25:39,640
there are studies that show that people

588
00:25:39,640 --> 00:25:41,730
who spend time in nature are happier.

589
00:25:41,730 --> 00:25:44,300
But there are also a
lot of monetary benefits

590
00:25:44,300 --> 00:25:46,106
and benefits to human health, right?

591
00:25:46,106 --> 00:25:48,860
We wouldn't be sequencing human genomes,

592
00:25:48,860 --> 00:25:51,220
we wouldn't be creating
the drugs that we have

593
00:25:51,220 --> 00:25:54,280
if we didn't find those biomolecules

594
00:25:54,280 --> 00:25:57,340
that we base that stuff
off in nature first, right?

595
00:25:57,340 --> 00:26:01,730
So, we wanna document biodiversity
and hopefully protect it.

596
00:26:01,730 --> 00:26:05,580
There's this really funny xkcd
comic I came across recently,

597
00:26:05,580 --> 00:26:09,310
and the joke here is that chemistry

598
00:26:09,310 --> 00:26:10,700
is a hundred percent complete

599
00:26:10,700 --> 00:26:13,100
with the discovery of the last molecule,

600
00:26:13,100 --> 00:26:17,950
physics is 98% complete,
and biology is 93% complete.

601
00:26:17,950 --> 00:26:19,670
And the funniest part about this comic

602
00:26:19,670 --> 00:26:23,410
is that if you're on the xkcd
website and you hover over it,

603
00:26:23,410 --> 00:26:25,817
a little window pops up and it says,

604
00:26:25,817 --> 00:26:27,380
"Biology is really struggling,

605
00:26:27,380 --> 00:26:30,780
they're only at 93% and they
keep finding more ants."

606
00:26:30,780 --> 00:26:33,270
And this is so true, right?

607
00:26:33,270 --> 00:26:35,070
We're not even close to documenting

608
00:26:35,070 --> 00:26:37,090
all biodiversity that exists.

609
00:26:37,090 --> 00:26:39,700
Right now, depending on
what estimates you look at,

610
00:26:39,700 --> 00:26:42,650
we've really only described one to 10%

611
00:26:42,650 --> 00:26:46,143
of all the biodiversity
that exists on our planet.

612
00:26:47,068 --> 00:26:50,000
And so, we often think about documenting

613
00:26:50,000 --> 00:26:52,220
or discovering new biodiversity

614
00:26:52,220 --> 00:26:54,880
by going out to these remote locations

615
00:26:54,880 --> 00:26:56,140
that haven't really been explored

616
00:26:56,140 --> 00:26:58,220
and just finding brand new species.

617
00:26:58,220 --> 00:27:02,180
And that is a big part of
documenting biodiversity,

618
00:27:02,180 --> 00:27:05,520
but we also have this
problem of cryptic species.

619
00:27:05,520 --> 00:27:09,160
So, cryptic species are organisms
that are appear identical,

620
00:27:09,160 --> 00:27:11,633
but are actually distinct species.

621
00:27:12,740 --> 00:27:14,870
I'm gonna just give you
a little more details

622
00:27:14,870 --> 00:27:16,010
about what this might look like.

623
00:27:16,010 --> 00:27:18,770
So, plethodontid salamanders

624
00:27:18,770 --> 00:27:20,860
are often reproductive isolated,

625
00:27:20,860 --> 00:27:23,370
so they represent different species,

626
00:27:23,370 --> 00:27:25,780
even when no obvious morphological

627
00:27:25,780 --> 00:27:28,720
or ecological differences are present.

628
00:27:28,720 --> 00:27:31,130
And this is true, not just
for plethodontid salamanders,

629
00:27:31,130 --> 00:27:35,323
but for basically all organisms
that exist on the planet.

630
00:27:36,320 --> 00:27:37,370
So, this group here,

631
00:27:37,370 --> 00:27:41,150
this is seven different
species of plethodon salamander

632
00:27:41,150 --> 00:27:43,790
that you can find in the
Southern Appalachians,

633
00:27:43,790 --> 00:27:47,320
this represents all
their distributions here.

634
00:27:47,320 --> 00:27:49,840
But for many years they
were considered one species,

635
00:27:49,840 --> 00:27:51,940
plethodon jordoni.

636
00:27:51,940 --> 00:27:54,550
In some cases, you can see some markings

637
00:27:54,550 --> 00:27:57,150
that might indicate that
there are different species,

638
00:27:57,150 --> 00:28:01,410
but I can guarantee that
color markings on salamanders

639
00:28:01,410 --> 00:28:03,200
are not always reliable indicators

640
00:28:03,200 --> 00:28:05,490
as to whether there are
different species or not.

641
00:28:05,490 --> 00:28:08,820
There's a lot of variation as
to whether this orange patch

642
00:28:08,820 --> 00:28:10,980
shows up in this species.

643
00:28:10,980 --> 00:28:14,270
But with the technological advances

644
00:28:14,270 --> 00:28:15,810
that we've had in sequencing DNA,

645
00:28:15,810 --> 00:28:18,830
we've been able to find
these cryptic species.

646
00:28:18,830 --> 00:28:21,680
And oftentimes it's not
even the goal of a project,

647
00:28:21,680 --> 00:28:26,000
but you might sequence a bunch
of DNA from this species,

648
00:28:26,000 --> 00:28:28,110
plethodon jordoni, and
then what you end up seeing

649
00:28:28,110 --> 00:28:31,890
are these discreet genetic
units within one species,

650
00:28:31,890 --> 00:28:34,740
and it turns out they actually
are more than one species.

651
00:28:36,360 --> 00:28:40,440
So, what we wanted to do was
develop a predictive model

652
00:28:40,440 --> 00:28:43,580
to see if we could find
shared characteristics

653
00:28:43,580 --> 00:28:45,300
of mammalian clades

654
00:28:45,300 --> 00:28:47,890
that are likely to
contain cryptic species.

655
00:28:47,890 --> 00:28:49,400
So, the first thing we need to do

656
00:28:49,400 --> 00:28:51,190
is get a bunch of genetic data

657
00:28:51,190 --> 00:28:53,230
and estimate cryptic species, right?

658
00:28:53,230 --> 00:28:54,197
So, we have a named species

659
00:28:54,197 --> 00:28:57,160
and we wanna know are there
cryptic species present

660
00:28:57,160 --> 00:28:59,460
in what we think is just one species?

661
00:28:59,460 --> 00:29:00,860
And then see if we can make predictions

662
00:29:00,860 --> 00:29:02,460
about where we might find those.

663
00:29:04,530 --> 00:29:07,370
First, I have a little note
here just to remind myself

664
00:29:07,370 --> 00:29:10,980
to tell you that I'm using
the term cryptic species,

665
00:29:10,980 --> 00:29:12,810
I might also say hidden diversity,

666
00:29:12,810 --> 00:29:14,520
but I'm meaning the same thing, right?

667
00:29:14,520 --> 00:29:17,280
We're talking about potentially,

668
00:29:17,280 --> 00:29:20,710
you know, cryptic species
in a single named species

669
00:29:20,710 --> 00:29:22,710
that we've identified with genetic data.

670
00:29:23,550 --> 00:29:24,950
So, the first steps here

671
00:29:24,950 --> 00:29:26,830
are what we've already
talked about, right?

672
00:29:26,830 --> 00:29:28,860
People are going out in
field and collecting data,

673
00:29:28,860 --> 00:29:32,200
putting it on these databases,
we're mining these data,

674
00:29:32,200 --> 00:29:36,623
aligning it, and putting it in
a format that we can analyze.

675
00:29:37,540 --> 00:29:38,960
So, what we've ended up with here

676
00:29:38,960 --> 00:29:43,350
is data for over 4,000 species of mammals,

677
00:29:43,350 --> 00:29:47,770
and this represents about
70% of named mammal species.

678
00:29:47,770 --> 00:29:51,070
We focused on two genes
that are commonly sequenced

679
00:29:51,070 --> 00:29:53,010
in phylogeographic studies,

680
00:29:53,010 --> 00:29:56,340
the cytochrome b gene and the
cytochrome oxidase 1 gene,

681
00:29:56,340 --> 00:29:57,750
but this allowed us to compare

682
00:29:57,750 --> 00:30:00,640
the highest number of species

683
00:30:00,640 --> 00:30:03,500
'cause there's a lot of
data for these two genes.

684
00:30:03,500 --> 00:30:05,660
So, if we look at this plot right here,

685
00:30:05,660 --> 00:30:09,100
this is carnivora what we're
seeing here, this dark bar,

686
00:30:09,100 --> 00:30:13,440
is showing that we have
cytochrome oxidase 1 sequence data

687
00:30:13,440 --> 00:30:16,788
for almost all species of carnivora.

688
00:30:16,788 --> 00:30:18,210
And then we have cytochrome b data

689
00:30:18,210 --> 00:30:21,740
for about half the species of carnivora.

690
00:30:21,740 --> 00:30:24,640
All the orders of mammals are
represented in our data set.

691
00:30:26,600 --> 00:30:28,010
We also see here,

692
00:30:28,010 --> 00:30:30,030
this is just the proportion of sequences

693
00:30:30,030 --> 00:30:31,930
and species diversity in the dataset.

694
00:30:31,930 --> 00:30:34,540
So, our data set consists
mostly of rodents,

695
00:30:34,540 --> 00:30:35,660
which isn't really surprising

696
00:30:35,660 --> 00:30:37,500
because there are more species of rodents

697
00:30:37,500 --> 00:30:40,600
than any other mammals followed by bats,

698
00:30:40,600 --> 00:30:43,870
ungulates, the shrews and moles,

699
00:30:43,870 --> 00:30:47,853
carnivores, primates, the
bunnies, and the opossums.

700
00:30:48,920 --> 00:30:49,830
And then this shows

701
00:30:49,830 --> 00:30:52,010
just the distribution of
the data that we have.

702
00:30:52,010 --> 00:30:55,510
So, you might notice that we
have a lot more data in Europe,

703
00:30:55,510 --> 00:30:58,090
North America, and Southeast Australia.

704
00:30:58,090 --> 00:31:00,270
And this is a common pattern
in biodiversity data,

705
00:31:00,270 --> 00:31:03,300
these are just areas that
have been well researched.

706
00:31:03,300 --> 00:31:06,830
So, we could use more of
those sort of expeditions

707
00:31:06,830 --> 00:31:09,750
where we find more species
in other parts of the world,

708
00:31:09,750 --> 00:31:13,410
but we do still have
pretty good global sampling

709
00:31:13,410 --> 00:31:14,313
in our data set.

710
00:31:16,580 --> 00:31:18,370
So then, the next step in the project

711
00:31:18,370 --> 00:31:21,080
was to estimate the hidden species

712
00:31:21,080 --> 00:31:23,253
or cryptic species that we have.

713
00:31:24,240 --> 00:31:28,070
So, we use methods of
molecular species limitation,

714
00:31:28,070 --> 00:31:31,470
and this is basically using genetic data

715
00:31:31,470 --> 00:31:35,093
to identify species level
biological diversity.

716
00:31:36,050 --> 00:31:38,700
So, what we're seeing here
is the phylogenetic tree

717
00:31:38,700 --> 00:31:41,970
of three species of aliatypus spiders.

718
00:31:41,970 --> 00:31:45,360
On the tips of the tree are individuals,

719
00:31:45,360 --> 00:31:46,870
so all the individuals here

720
00:31:46,870 --> 00:31:49,607
belong to this species thompsoni.

721
00:31:50,560 --> 00:31:54,380
All the individuals here belong
to the species starretti,

722
00:31:54,380 --> 00:31:58,650
and all the individuals here
belong to our roxxiae species.

723
00:31:58,650 --> 00:32:01,900
So, we can take DNA sequences
from all these individuals

724
00:32:01,900 --> 00:32:04,370
across the three species
and run them through

725
00:32:04,370 --> 00:32:08,073
our molecular species
delimitation analyses.

726
00:32:08,950 --> 00:32:11,340
In this case, we used
two different methods.

727
00:32:11,340 --> 00:32:12,780
We often use multiple methods

728
00:32:12,780 --> 00:32:15,410
because every method that we use

729
00:32:15,410 --> 00:32:17,780
has some set of assumptions,

730
00:32:17,780 --> 00:32:19,700
so we like to use multiple methods.

731
00:32:19,700 --> 00:32:22,070
What we see here in this first species,

732
00:32:22,070 --> 00:32:25,320
the species delimitation analyses match

733
00:32:25,320 --> 00:32:27,090
the current species description

734
00:32:27,090 --> 00:32:29,403
so we don't have cryptic species present.

735
00:32:30,410 --> 00:32:33,070
That's also true for this one here, right?

736
00:32:33,070 --> 00:32:34,410
The species description

737
00:32:34,410 --> 00:32:36,970
matches what we see in the genetic data.

738
00:32:36,970 --> 00:32:41,040
Alternatively, in our
third species over here,

739
00:32:41,040 --> 00:32:42,420
the genetic data suggests

740
00:32:42,420 --> 00:32:45,130
that there are actually four species

741
00:32:45,130 --> 00:32:48,850
within our currently
recognized one species.

742
00:32:48,850 --> 00:32:51,280
So, when we're running our
random forest analysis,

743
00:32:51,280 --> 00:32:53,240
we are coding our species

744
00:32:53,240 --> 00:32:55,410
as either having cryptic species or not.

745
00:32:55,410 --> 00:32:56,450
So both of these,

746
00:32:56,450 --> 00:32:58,740
we say no, we don't have
any cryptic species,

747
00:32:58,740 --> 00:33:00,960
but in this named species we do.

748
00:33:00,960 --> 00:33:02,750
So, this is gonna be our response variable

749
00:33:02,750 --> 00:33:05,180
in our random forest analysis.

750
00:33:05,180 --> 00:33:07,900
But first, I'm just
gonna summarize the data

751
00:33:07,900 --> 00:33:11,360
from the molecular species
delimitation analysis

752
00:33:11,360 --> 00:33:12,193
that we did.

753
00:33:13,490 --> 00:33:14,760
So, this is a phylogenetic tree

754
00:33:14,760 --> 00:33:17,380
that represents all the orders of mammals.

755
00:33:17,380 --> 00:33:19,570
And what we're seeing
here is the shaded region

756
00:33:19,570 --> 00:33:24,570
is the number of species
estimated using the genetic data

757
00:33:24,810 --> 00:33:27,670
relative to the number of named species,

758
00:33:27,670 --> 00:33:32,180
which is the solid figure
for all the different orders.

759
00:33:32,180 --> 00:33:33,460
And there's some variation here,

760
00:33:33,460 --> 00:33:35,390
so we have a lot of cryptic species

761
00:33:35,390 --> 00:33:39,550
in the opossums, rodents,
bunnies, the hyraxes.

762
00:33:39,550 --> 00:33:41,270
There are not a lot of cryptic species,

763
00:33:41,270 --> 00:33:44,640
or we didn't detect any cryptic
species in the sea cows,

764
00:33:44,640 --> 00:33:47,360
very few in the armadillos.

765
00:33:47,360 --> 00:33:50,370
But overall, there are
a lot of cryptic species

766
00:33:50,370 --> 00:33:51,573
present in mammals.

767
00:33:52,540 --> 00:33:56,480
So then, the next thing was to collect

768
00:33:57,370 --> 00:33:58,420
our predictor variables,

769
00:33:58,420 --> 00:34:02,290
so we have environmental,
geographic, climatic information

770
00:34:02,290 --> 00:34:05,540
that we're gonna use to
see if we can predict

771
00:34:05,540 --> 00:34:07,260
the presence of cryptic species.

772
00:34:07,260 --> 00:34:11,960
So, we use GPS coordinates to
extract things like elevation,

773
00:34:11,960 --> 00:34:14,980
average temperature, range in temperature.

774
00:34:14,980 --> 00:34:18,010
What is the relative
day to night temperature

775
00:34:18,010 --> 00:34:20,330
compared to the seasonal variation,

776
00:34:20,330 --> 00:34:22,840
all kinds of climatic variables.

777
00:34:22,840 --> 00:34:24,330
The nice thing about studying mammals

778
00:34:24,330 --> 00:34:26,070
is that there's this really large database

779
00:34:26,070 --> 00:34:28,120
that houses life history characteristics

780
00:34:28,120 --> 00:34:29,560
for much of mammals,

781
00:34:29,560 --> 00:34:33,270
so things like clutch size,
age of first reproduction,

782
00:34:33,270 --> 00:34:36,720
body mass, and several others.

783
00:34:36,720 --> 00:34:38,750
And then we also collected
geographic information,

784
00:34:38,750 --> 00:34:40,430
so what is the size of the range?

785
00:34:40,430 --> 00:34:42,170
Is it large, small?

786
00:34:42,170 --> 00:34:44,000
How closes it to the equator, right?

787
00:34:44,000 --> 00:34:46,010
So, these are our predictor variables.

788
00:34:46,010 --> 00:34:47,490
So, in our random forest analysis,

789
00:34:47,490 --> 00:34:50,390
all the things that are in the
nodes are all those climatic,

790
00:34:50,390 --> 00:34:52,770
environmental, geographic variables.

791
00:34:52,770 --> 00:34:56,110
And at our tips are all the named species,

792
00:34:56,110 --> 00:34:59,280
and there are either no
cryptic species present,

793
00:34:59,280 --> 00:35:01,160
or there are cryptic species present

794
00:35:01,160 --> 00:35:02,760
where we detected more than one species

795
00:35:02,760 --> 00:35:03,860
with the genetic data.

796
00:35:05,240 --> 00:35:07,560
So, our goal here was to see

797
00:35:07,560 --> 00:35:11,340
if we could predict hidden diversity.

798
00:35:11,340 --> 00:35:15,210
So, the first thing that we did
was pull out which variables

799
00:35:15,210 --> 00:35:17,300
were the most important in
predicting hidden diversity.

800
00:35:17,300 --> 00:35:19,140
So, our model performed well,

801
00:35:19,140 --> 00:35:21,600
we were able to make those predictions.

802
00:35:21,600 --> 00:35:23,460
So, what we have here on our x-axis

803
00:35:23,460 --> 00:35:25,210
is our variable importance.

804
00:35:25,210 --> 00:35:27,080
So, the larger the bars here,

805
00:35:27,080 --> 00:35:30,070
the more important the variables are,

806
00:35:30,070 --> 00:35:31,390
and we have them ordered.

807
00:35:31,390 --> 00:35:33,720
Our top two predictor variables

808
00:35:33,720 --> 00:35:36,140
that were the most important
in making these predictions

809
00:35:36,140 --> 00:35:38,730
were body mass and range area.

810
00:35:38,730 --> 00:35:42,310
So, on average, species
that had hidden diversity

811
00:35:42,310 --> 00:35:44,940
or cryptic species within them

812
00:35:44,940 --> 00:35:47,490
had overall smaller body size

813
00:35:47,490 --> 00:35:50,620
than those that did not
contain cryptic species.

814
00:35:50,620 --> 00:35:54,530
Alternatively, species
that had hidden diversity

815
00:35:54,530 --> 00:35:56,550
or cryptic species present

816
00:35:56,550 --> 00:35:59,043
had larger ranges than those that did not.

817
00:36:00,060 --> 00:36:03,020
These variables were
not normally distributed

818
00:36:03,020 --> 00:36:05,170
so we conducted a nonparametric test

819
00:36:05,170 --> 00:36:08,410
and the body size and range area

820
00:36:08,410 --> 00:36:10,440
are significantly different for species

821
00:36:10,440 --> 00:36:14,363
who do or do not contain cryptic species.

822
00:36:15,570 --> 00:36:17,540
So, to summarize this,

823
00:36:17,540 --> 00:36:19,840
there are hundreds of
prescribed mammal species,

824
00:36:19,840 --> 00:36:22,960
and this wasn't super surprising for us.

825
00:36:22,960 --> 00:36:27,300
There was a study done in 2007
that looked at the literature

826
00:36:27,300 --> 00:36:31,550
and made some conclusions
about cryptic species reports.

827
00:36:31,550 --> 00:36:33,260
So, what we're seeing on the x-axis here

828
00:36:33,260 --> 00:36:35,980
is just the number of species present

829
00:36:35,980 --> 00:36:38,523
in all these different metazoa and taxa.

830
00:36:39,560 --> 00:36:43,200
On the y-axis here, we have
cryptic species reports.

831
00:36:43,200 --> 00:36:46,380
So basically, the higher
number of species present

832
00:36:46,380 --> 00:36:48,370
in any of these taxa

833
00:36:48,370 --> 00:36:50,700
was correlated with the number
of cryptic species reports.

834
00:36:50,700 --> 00:36:53,920
So, the more species that
exist existed within a group,

835
00:36:53,920 --> 00:36:55,360
the more cryptic species reports.

836
00:36:55,360 --> 00:36:57,580
So, it's a common practice
that we see, right?

837
00:36:57,580 --> 00:36:59,090
Cryptic species are prevalent

838
00:36:59,090 --> 00:37:01,340
across all these different taxa.

839
00:37:01,340 --> 00:37:02,940
Mammals do stand out a little bit,

840
00:37:02,940 --> 00:37:06,680
they seem to have a higher
number of cryptic species reports

841
00:37:06,680 --> 00:37:08,850
per number of species in the group

842
00:37:08,850 --> 00:37:12,240
compared to some of these
other metazoa and taxa,

843
00:37:12,240 --> 00:37:15,050
but this is likely due
to the fact that mammals

844
00:37:15,050 --> 00:37:16,963
are a really well studied group.

845
00:37:18,170 --> 00:37:21,330
But in addition into sort of being aware

846
00:37:21,330 --> 00:37:23,732
that there are hundreds of
undescribed mammal species,

847
00:37:23,732 --> 00:37:26,050
we now know where to look for them, right?

848
00:37:26,050 --> 00:37:29,340
So, they're most likely to
be found in small body taxa

849
00:37:29,340 --> 00:37:31,380
with large ranges.

850
00:37:31,380 --> 00:37:32,470
Within these large ranges,

851
00:37:32,470 --> 00:37:34,420
there also tends to be high variability

852
00:37:34,420 --> 00:37:36,330
in temperature and precipitation.

853
00:37:36,330 --> 00:37:38,900
And both of these things make
sense if we think about it

854
00:37:38,900 --> 00:37:42,000
because small taxa are harder to find,

855
00:37:42,000 --> 00:37:44,270
and they're also gonna be harder to find

856
00:37:44,270 --> 00:37:46,470
defining morphological characteristics

857
00:37:46,470 --> 00:37:48,470
to describe species with.

858
00:37:48,470 --> 00:37:51,000
And having large ranges
with high variability

859
00:37:51,000 --> 00:37:55,690
leaves more room for
multiple ecological niches

860
00:37:55,690 --> 00:37:57,990
and local adaptation to happen.

861
00:37:57,990 --> 00:38:00,520
So, we really see this as a starting point

862
00:38:00,520 --> 00:38:01,870
for where to focus efforts

863
00:38:01,870 --> 00:38:04,290
because going out in the field
and collecting all the data

864
00:38:04,290 --> 00:38:07,220
that's necessary to describe

865
00:38:07,220 --> 00:38:09,710
and formally name a species can...

866
00:38:09,710 --> 00:38:11,020
It's a lot of effort, right?

867
00:38:11,020 --> 00:38:11,990
But now we know where to look

868
00:38:11,990 --> 00:38:14,820
and so we can make this process
a little more efficient.

869
00:38:14,820 --> 00:38:17,480
We can also use this
predictive model for species

870
00:38:17,480 --> 00:38:19,120
in which genetic data don't exist,

871
00:38:19,120 --> 00:38:22,410
so sequencing DNA a is very expensive.

872
00:38:22,410 --> 00:38:23,790
So, we might have a lot of species

873
00:38:23,790 --> 00:38:25,540
that we have our predictor variables for,

874
00:38:25,540 --> 00:38:29,090
we can know a lot about the environmental

875
00:38:29,090 --> 00:38:31,070
and geographic information just by knowing

876
00:38:31,070 --> 00:38:32,570
where the species is.

877
00:38:32,570 --> 00:38:33,470
And then we can predict

878
00:38:33,470 --> 00:38:35,690
whether there might be
cryptic species in there

879
00:38:35,690 --> 00:38:37,790
without even having to
have the genetic data.

880
00:38:37,790 --> 00:38:40,100
And then we can go
collect that secondarily

881
00:38:40,100 --> 00:38:44,773
after we know that that might
be a species of interest.

882
00:38:50,670 --> 00:38:54,140
So, the second example I'm
gonna share using this approach,

883
00:38:54,140 --> 00:38:56,230
we attempted to estimate the probability

884
00:38:56,230 --> 00:38:58,720
of being at-risk for land plants.

885
00:38:58,720 --> 00:39:02,360
So, the International Union
for Conservation of Nature

886
00:39:02,360 --> 00:39:05,850
has a Red List that categorizes species

887
00:39:05,850 --> 00:39:07,860
as being threatened or not threatened.

888
00:39:07,860 --> 00:39:10,660
And that this is the most
comprehensive list that exists

889
00:39:10,660 --> 00:39:13,650
for putting species into these categories.

890
00:39:13,650 --> 00:39:17,767
So, species can be labeled
as critically endangered,

891
00:39:17,767 --> 00:39:22,510
endangered, vulnerable, near
threatened, or least concern.

892
00:39:22,510 --> 00:39:27,510
However, only about 10% of
currently recognized species

893
00:39:28,120 --> 00:39:29,100
are Red Listed.

894
00:39:29,100 --> 00:39:30,340
And again, this is a method

895
00:39:30,340 --> 00:39:32,650
that it's just time
consuming and expensive.

896
00:39:32,650 --> 00:39:33,800
We need a lot of information

897
00:39:33,800 --> 00:39:37,514
to accurately categorize any species.

898
00:39:37,514 --> 00:39:39,207
So, there's a only about
10% of species listed,

899
00:39:39,207 --> 00:39:41,430
and this is even lower for land plants.

900
00:39:41,430 --> 00:39:44,689
And there has been more of a push

901
00:39:44,689 --> 00:39:47,110
to focus on the conservation of plants.

902
00:39:47,110 --> 00:39:48,000
And this makes sense

903
00:39:48,000 --> 00:39:51,030
because they're often at
the base of the food chain,

904
00:39:51,030 --> 00:39:53,480
they provide habitat for
a lot of other organisms.

905
00:39:53,480 --> 00:39:56,853
And so, if we're protecting plant species

906
00:39:56,853 --> 00:39:59,040
we're secondary are gonna protect

907
00:39:59,040 --> 00:40:01,750
a lot of other species as well.

908
00:40:01,750 --> 00:40:03,470
So, since not a lot of species are listed,

909
00:40:03,470 --> 00:40:05,130
we wanted to see if we could estimate

910
00:40:05,130 --> 00:40:07,600
the probability of being at-risk

911
00:40:07,600 --> 00:40:10,600
for things that aren't actually
already on the Red List.

912
00:40:10,600 --> 00:40:12,910
So, we downloaded GPS coordinates

913
00:40:12,910 --> 00:40:14,620
for all land plants from GBIF,

914
00:40:14,620 --> 00:40:18,150
this is the database that has
all the occurrence records.

915
00:40:18,150 --> 00:40:21,810
And this was one of the
first studies that we did

916
00:40:21,810 --> 00:40:23,350
using this data science approach,

917
00:40:23,350 --> 00:40:26,890
so we didn't actually use any
genetic data for this study,

918
00:40:26,890 --> 00:40:28,680
we just used the GPS coordinates

919
00:40:29,560 --> 00:40:31,870
and the information from the Red List.

920
00:40:31,870 --> 00:40:35,230
So, we used the GPS coordinates
to extract environmental

921
00:40:35,230 --> 00:40:36,480
and geographic information

922
00:40:36,480 --> 00:40:39,960
the same way that I
talked about previously.

923
00:40:39,960 --> 00:40:41,800
So, for our random forest model,

924
00:40:41,800 --> 00:40:45,410
our response was being threatened
or non-threatened, right?

925
00:40:45,410 --> 00:40:48,340
So, species belong to any
one of these categories,

926
00:40:48,340 --> 00:40:50,280
so that's a response variable.

927
00:40:50,280 --> 00:40:53,250
And then our predictors
are things like range size,

928
00:40:53,250 --> 00:40:56,010
maximum latitude, average temperature,

929
00:40:56,010 --> 00:40:58,240
range in temperature,
isothermality, right?

930
00:40:58,240 --> 00:41:00,600
Things that describe the distribution,

931
00:41:00,600 --> 00:41:02,783
the geographic distribution
of the species.

932
00:41:04,350 --> 00:41:08,570
This map here represents
our geographic sampling,

933
00:41:08,570 --> 00:41:11,310
so the numbers are just the
number of GPS coordinates

934
00:41:11,310 --> 00:41:14,060
that we have in any particular location.

935
00:41:14,060 --> 00:41:17,200
And we ended up with over 12,000 species

936
00:41:17,200 --> 00:41:19,930
that we had the right kind of data for

937
00:41:19,930 --> 00:41:21,930
that are on the Red List.

938
00:41:21,930 --> 00:41:23,990
So, they are categorized
as being threatened

939
00:41:23,990 --> 00:41:26,250
or not threatened at some level,

940
00:41:26,250 --> 00:41:29,780
and we used those to build
the predictive model.

941
00:41:29,780 --> 00:41:33,000
And then we had data
for over 150,000 species

942
00:41:33,000 --> 00:41:38,000
that aren't listed, but we
could estimate their probability

943
00:41:38,150 --> 00:41:39,650
of being listed as endangered.

944
00:41:41,960 --> 00:41:45,310
So, we broke up the data by
continent and on each continent

945
00:41:45,310 --> 00:41:47,370
we had tens of thousands of species

946
00:41:47,370 --> 00:41:49,350
that we were making predictions for

947
00:41:49,350 --> 00:41:51,900
that didn't exist on the Red List.

948
00:41:51,900 --> 00:41:54,160
We also had had a global data set

949
00:41:54,160 --> 00:41:56,300
where we included species that were found

950
00:41:56,300 --> 00:41:59,590
on more than one continent.

951
00:41:59,590 --> 00:42:01,940
And what we ended up seeing
was on every continent,

952
00:42:01,940 --> 00:42:04,770
there were over a thousand plant species

953
00:42:04,770 --> 00:42:07,270
that had a high probability
of being threatened.

954
00:42:07,270 --> 00:42:09,260
So, had a probability of belonging

955
00:42:09,260 --> 00:42:12,020
to either the critically
endangered, endangered,

956
00:42:12,020 --> 00:42:16,860
or vulnerable group with a
probability greater than .8.

957
00:42:16,860 --> 00:42:20,323
This number is even higher
if we lower that threshold.

958
00:42:21,940 --> 00:42:24,790
And so, this can be
really useful for people

959
00:42:24,790 --> 00:42:27,590
who are sort of on the front
lines of conservation efforts.

960
00:42:27,590 --> 00:42:29,750
You know, people who work for
the National Forest Service

961
00:42:29,750 --> 00:42:32,640
or USGS, and I'm sure
that they will tell you

962
00:42:32,640 --> 00:42:34,960
that they are very time and money limited,

963
00:42:34,960 --> 00:42:38,890
and so this can help focus
efforts in a more efficient way.

964
00:42:38,890 --> 00:42:42,530
So now, we have a database
of over 150,000 species

965
00:42:42,530 --> 00:42:44,880
with a probability of being listed

966
00:42:44,880 --> 00:42:46,763
in any one of these categories.

967
00:42:47,840 --> 00:42:51,250
So, trillium species are
native to the United States,

968
00:42:51,250 --> 00:42:54,810
they are, a lot of them
are listed as endangered.

969
00:42:54,810 --> 00:42:56,630
And they're a very sensitive species,

970
00:42:56,630 --> 00:42:58,040
so in some parts of the country,

971
00:42:58,040 --> 00:42:59,770
it's illegal to pick these

972
00:42:59,770 --> 00:43:02,630
because even if you just take
a little piece off of a plant,

973
00:43:02,630 --> 00:43:05,970
it will often die, so it's
a very sensitive species.

974
00:43:05,970 --> 00:43:07,290
However, you can see by this list,

975
00:43:07,290 --> 00:43:12,290
a lot of trillium species
are not on the IUCN Red List.

976
00:43:12,580 --> 00:43:15,250
So, rather than having to go through

977
00:43:15,250 --> 00:43:18,100
and assess every single
one of these species,

978
00:43:18,100 --> 00:43:19,960
which can be a difficult task,

979
00:43:19,960 --> 00:43:21,500
you could scan through and say, "Okay,

980
00:43:21,500 --> 00:43:25,200
this one species in particular
has a pretty high probability

981
00:43:25,200 --> 00:43:26,860
of being critically endangered

982
00:43:26,860 --> 00:43:28,110
so we should focus our efforts

983
00:43:28,110 --> 00:43:29,997
on that one particular species."

984
00:43:31,050 --> 00:43:34,550
In addition to that, we
average the probability

985
00:43:34,550 --> 00:43:39,200
of being at risk and plotted
these on a global map.

986
00:43:39,200 --> 00:43:40,760
So, this gives us hotspots

987
00:43:40,760 --> 00:43:43,880
of places in need of conservation, right?

988
00:43:43,880 --> 00:43:47,580
So, these areas that have
the red and orange colors

989
00:43:47,580 --> 00:43:49,280
are areas that have a lot of species

990
00:43:49,280 --> 00:43:51,300
with a high probability of being at risk.

991
00:43:51,300 --> 00:43:53,810
And this can help us
focus not just on species,

992
00:43:53,810 --> 00:43:56,140
but geographic areas that
are in high need, right?

993
00:43:56,140 --> 00:44:01,140
So, the coast of North America,
Southeast United States,

994
00:44:01,430 --> 00:44:06,430
this western coast of Africa,
and this Indo-Pacific region.

995
00:44:06,820 --> 00:44:09,090
Right, so if we're sort
of trying to figure out

996
00:44:09,090 --> 00:44:12,180
where to allocate funds for
protecting biodiversity,

997
00:44:12,180 --> 00:44:13,710
we might say something like, "Well,

998
00:44:13,710 --> 00:44:16,160
Everglades National Park
is definitely an area

999
00:44:16,160 --> 00:44:18,090
that need that needs some protection

1000
00:44:18,090 --> 00:44:18,923
since we have a lot

1001
00:44:18,923 --> 00:44:21,537
of potentially endangered species there."

1002
00:44:24,200 --> 00:44:27,590
So, this work did get a little
bit of a press attention.

1003
00:44:27,590 --> 00:44:29,840
And I bring that up because one,

1004
00:44:29,840 --> 00:44:32,720
I wanna convince you that
people care about biodiversity,

1005
00:44:32,720 --> 00:44:35,150
but also the subtitle here,

1006
00:44:35,150 --> 00:44:36,920
when researchers used machine learning

1007
00:44:36,920 --> 00:44:39,870
to evaluate 150,000 plant species,

1008
00:44:39,870 --> 00:44:42,120
they found that 10% were likely to qualify

1009
00:44:42,120 --> 00:44:43,400
for the IUCN Red List."

1010
00:44:43,400 --> 00:44:46,130
And I think that subtitles like this

1011
00:44:46,130 --> 00:44:48,010
make things sound really fancy, right?

1012
00:44:48,010 --> 00:44:50,550
Machine learning, and
it's hard to understand

1013
00:44:50,550 --> 00:44:53,450
what that means, but I hope
that I broke it down enough

1014
00:44:53,450 --> 00:44:57,520
for you to show you that
these methods aren't as fancy

1015
00:44:58,490 --> 00:44:59,440
as they sound, right?

1016
00:44:59,440 --> 00:45:02,190
Computational biology
is really just the act

1017
00:45:02,190 --> 00:45:06,560
of taking big problems and
breaking them into small pieces

1018
00:45:06,560 --> 00:45:08,900
so that we can answer big questions.

1019
00:45:08,900 --> 00:45:12,470
And, you know, if someone
told me 10 years ago

1020
00:45:12,470 --> 00:45:13,730
or when I was at Salem State

1021
00:45:13,730 --> 00:45:15,870
that I would be spending
the majority of my time

1022
00:45:15,870 --> 00:45:19,230
coding in front of a computer
to understand about diversity,

1023
00:45:19,230 --> 00:45:22,020
I would've told them that
they were absolutely insane

1024
00:45:22,020 --> 00:45:23,640
and didn't know anything about me.

1025
00:45:23,640 --> 00:45:26,340
And I bring that up because
I think it's important

1026
00:45:26,340 --> 00:45:28,540
for the undergraduates in the audience

1027
00:45:28,540 --> 00:45:31,210
to sort of be open to things

1028
00:45:31,210 --> 00:45:32,720
and know that it's okay
to change your mind

1029
00:45:32,720 --> 00:45:33,740
about what you wanna do

1030
00:45:33,740 --> 00:45:35,843
and what you end up
doing might not look like

1031
00:45:35,843 --> 00:45:37,370
what you think it's gonna look like,

1032
00:45:37,370 --> 00:45:39,083
and that's totally fine.

1033
00:45:41,120 --> 00:45:43,620
So, this is where my path took me, right?

1034
00:45:43,620 --> 00:45:47,160
Using data science approaches
to understand biodiversity.

1035
00:45:47,160 --> 00:45:50,470
And the big goal is to understand

1036
00:45:50,470 --> 00:45:52,940
the processes of evolution,

1037
00:45:52,940 --> 00:45:56,040
why about diversity patterns
look the way they do,

1038
00:45:56,040 --> 00:45:57,530
and then also to produce data

1039
00:45:57,530 --> 00:46:02,230
that can be useful in targeting
and documenting species

1040
00:46:02,230 --> 00:46:04,253
in order to protect biodiversity.

1041
00:46:05,270 --> 00:46:08,360
The second goal is to
increase the back and forth

1042
00:46:08,360 --> 00:46:11,060
of people who are in the
field collecting the data,

1043
00:46:11,060 --> 00:46:12,930
putting the data on databases,

1044
00:46:12,930 --> 00:46:15,030
aggregating the data,
and then using the data.

1045
00:46:15,030 --> 00:46:17,880
Right, so I mentioned that
the data that we're using

1046
00:46:17,880 --> 00:46:21,300
is a very, very small proportion
of the data that exists.

1047
00:46:21,300 --> 00:46:25,070
And so, increasing data standards

1048
00:46:25,070 --> 00:46:27,680
for putting data on databases

1049
00:46:27,680 --> 00:46:30,180
could really help us
improve these databases

1050
00:46:30,180 --> 00:46:32,850
and have a lot more data that we can use.

1051
00:46:32,850 --> 00:46:33,683
And then the last goal,

1052
00:46:33,683 --> 00:46:36,440
and this is the one that I'm
kind of the most excited about

1053
00:46:36,440 --> 00:46:39,050
right now is to increase the accessibility

1054
00:46:39,050 --> 00:46:42,910
of these large databases for
researchers and instructors

1055
00:46:42,910 --> 00:46:46,240
without the resources or skills to do so.

1056
00:46:46,240 --> 00:46:49,500
So, our database is gonna be
free and open to the public

1057
00:46:49,500 --> 00:46:50,680
in the next couple of months.

1058
00:46:50,680 --> 00:46:55,000
And with that, we have
tutorials or teaching modules

1059
00:46:55,000 --> 00:46:57,187
that you can use that use the data,

1060
00:46:57,187 --> 00:46:59,910
and all of these have been
used in the classroom.

1061
00:46:59,910 --> 00:47:02,450
Right now, they're based
on R, so if you have,

1062
00:47:02,450 --> 00:47:04,790
you might need a little
bit of R experience

1063
00:47:04,790 --> 00:47:08,730
to use these as a researcher or a teacher,

1064
00:47:08,730 --> 00:47:10,840
but we are also developing Shiny R apps

1065
00:47:10,840 --> 00:47:13,150
where you can use these
data in the classroom

1066
00:47:13,150 --> 00:47:14,530
without any coding experience

1067
00:47:14,530 --> 00:47:18,030
and you can just click on
buttons to use those data.

1068
00:47:18,030 --> 00:47:20,810
And they make really good
research projects too,

1069
00:47:20,810 --> 00:47:22,880
one of my undergraduate research students

1070
00:47:22,880 --> 00:47:26,670
used this genetic diversity R tutorial

1071
00:47:26,670 --> 00:47:29,613
to conduct a research project.

1072
00:47:30,900 --> 00:47:33,090
So, I'm gonna wrap up by showing you this,

1073
00:47:33,090 --> 00:47:37,100
this map of biodiversity
hotspots in the United States.

1074
00:47:37,100 --> 00:47:38,680
I'm very lucky to be located

1075
00:47:38,680 --> 00:47:43,050
here in the southwestern part of Virginia.

1076
00:47:43,050 --> 00:47:44,410
You guys are in Massachusetts here

1077
00:47:44,410 --> 00:47:46,360
so your biodiversity isn't quite as high,

1078
00:47:46,360 --> 00:47:48,570
but it's not all gray.

1079
00:47:48,570 --> 00:47:51,050
So, I went on our phylogatR database

1080
00:47:51,050 --> 00:47:52,530
and just to see what we had there.

1081
00:47:52,530 --> 00:47:54,990
And there's DNA sequenced data

1082
00:47:54,990 --> 00:47:58,100
with GPS coordinates
for over 1,100 species.

1083
00:47:58,100 --> 00:48:00,220
This is the distribution of the data,

1084
00:48:00,220 --> 00:48:01,420
so the number of species,

1085
00:48:01,420 --> 00:48:04,750
and all these different phyla
that exists on the database.

1086
00:48:04,750 --> 00:48:08,090
And so, if anything that I said

1087
00:48:08,090 --> 00:48:09,500
sounds sort of interesting to you

1088
00:48:09,500 --> 00:48:10,990
and you wanted to use it in a classroom,

1089
00:48:10,990 --> 00:48:12,620
or you were an undergraduate researcher

1090
00:48:12,620 --> 00:48:15,500
and wanted to understand biodiversity,

1091
00:48:15,500 --> 00:48:16,790
these data will be available.

1092
00:48:16,790 --> 00:48:18,510
And now you have a contact

1093
00:48:18,510 --> 00:48:21,730
for someone who has directly
put this database together

1094
00:48:21,730 --> 00:48:23,480
and you can always reach out to us.

1095
00:48:24,759 --> 00:48:26,960
So with that, I'm just gonna acknowledge

1096
00:48:26,960 --> 00:48:29,150
the people who contributed to the database

1097
00:48:29,150 --> 00:48:32,163
and the two projects I
discussed with you today.

1098
00:48:33,430 --> 00:48:35,060
All this work was funded by NSF,

1099
00:48:35,060 --> 00:48:38,130
the Ohio Supercomputer
houses the database.

1100
00:48:38,130 --> 00:48:40,690
I put both my parents on here
who might be in the audience.

1101
00:48:40,690 --> 00:48:43,260
My dad helped me a lot in the field

1102
00:48:43,260 --> 00:48:44,990
when I was working on my dissertation.

1103
00:48:44,990 --> 00:48:47,890
And my mom helped me when I
first started to learn to code,

1104
00:48:47,890 --> 00:48:50,540
she works in the IT
department at Salem State.

1105
00:48:50,540 --> 00:48:53,553
And with that, I will take any questions.

1106
00:48:57,420 --> 00:49:00,500
- Dr. Pelletier, Tara, I can't...

1107
00:49:00,500 --> 00:49:03,520
I am so excited and that
was such a wonderful talk,

1108
00:49:03,520 --> 00:49:05,320
and thank you, thank you, thank you.

1109
00:49:07,420 --> 00:49:10,920
And I wanted to sort of go

1110
00:49:10,920 --> 00:49:14,500
into the what kind of connections you have

1111
00:49:14,500 --> 00:49:16,560
because you mentioned
that this how wonderful

1112
00:49:16,560 --> 00:49:20,330
this would be for groups

1113
00:49:20,330 --> 00:49:24,720
such as the Forest Service and so forth,

1114
00:49:24,720 --> 00:49:27,220
have they actually shown
an interest in this?

1115
00:49:27,220 --> 00:49:29,320
And have you been able
to make some connections?

1116
00:49:29,320 --> 00:49:34,120
And where is that kind of avenue headed?

1117
00:49:34,120 --> 00:49:37,560
- Yeah, so this is such
an important question.

1118
00:49:37,560 --> 00:49:39,380
I actually have a really long answer,

1119
00:49:39,380 --> 00:49:41,910
so if I go too off the
rails, you can stop me.

1120
00:49:41,910 --> 00:49:45,000
But so, I actually did have
the Northwest Forest Service

1121
00:49:45,000 --> 00:49:48,290
reach out to me recently to
ask for some of my data layers,

1122
00:49:48,290 --> 00:49:49,710
which I found really encouraging

1123
00:49:49,710 --> 00:49:53,240
because I find that phylogeography

1124
00:49:53,240 --> 00:49:55,610
can be a very sort of theoretical field,

1125
00:49:55,610 --> 00:49:56,890
and we get really caught up

1126
00:49:56,890 --> 00:49:59,300
in just how do we answer these questions?

1127
00:49:59,300 --> 00:50:02,460
And don't always think about how to apply

1128
00:50:02,460 --> 00:50:04,433
the answers to those questions.

1129
00:50:06,610 --> 00:50:08,170
And I see it in these

1130
00:50:08,170 --> 00:50:10,530
molecular species delimitation
analyses all the time

1131
00:50:10,530 --> 00:50:11,960
where people will say like, "Oh,

1132
00:50:11,960 --> 00:50:15,120
there's cryptic species present here,"

1133
00:50:15,120 --> 00:50:16,700
but they don't always
take it to the next step

1134
00:50:16,700 --> 00:50:18,140
to formally describe those species.

1135
00:50:18,140 --> 00:50:19,830
And a lot of times we just
don't have the skills,

1136
00:50:19,830 --> 00:50:22,410
or time, or whatever to do it.

1137
00:50:22,410 --> 00:50:24,040
So, I am hoping that some of this work

1138
00:50:24,040 --> 00:50:25,680
will bridge those gaps a little bit more

1139
00:50:25,680 --> 00:50:27,610
and make the results that we're seeing

1140
00:50:27,610 --> 00:50:30,100
more applicable to like those people

1141
00:50:30,100 --> 00:50:31,600
who are describing species

1142
00:50:31,600 --> 00:50:33,790
and people who are
doing conservation work.

1143
00:50:33,790 --> 00:50:36,350
So, I think so, like I said,

1144
00:50:36,350 --> 00:50:39,820
the Northwest Forest
Service did reach out to me.

1145
00:50:39,820 --> 00:50:42,570
So far, that's all I've
gotten interest from,

1146
00:50:42,570 --> 00:50:45,610
but the Hidden Diversity Project

1147
00:50:45,610 --> 00:50:47,620
is actually in revision right now,

1148
00:50:47,620 --> 00:50:48,660
we're just kind of wrapping it up.

1149
00:50:48,660 --> 00:50:49,770
So, it hasn't been published yet,

1150
00:50:49,770 --> 00:50:52,020
but it probably will be
in the next couple months.

1151
00:50:52,020 --> 00:50:53,280
And the database will be available

1152
00:50:53,280 --> 00:50:55,150
in the next couple months too.

1153
00:50:55,150 --> 00:50:58,520
- I would, so as somebody who's older

1154
00:50:58,520 --> 00:51:02,040
and some of the people in
these groups might be older,

1155
00:51:02,040 --> 00:51:03,820
the fact that you're
developing in a button

1156
00:51:03,820 --> 00:51:07,900
kind of format and that kind of thing,

1157
00:51:07,900 --> 00:51:09,210
yes, it's great for teachers,

1158
00:51:09,210 --> 00:51:12,290
but it's also great for
people who are not coders,

1159
00:51:12,290 --> 00:51:13,970
and so that could help.
- Yeah, you know,

1160
00:51:13,970 --> 00:51:16,510
it depends too like what
are your learning objectives

1161
00:51:16,510 --> 00:51:17,390
for a course, right?

1162
00:51:17,390 --> 00:51:19,390
Like I like to teach coding and R

1163
00:51:19,390 --> 00:51:23,150
because I think it teaches some
like organizational skills,

1164
00:51:23,150 --> 00:51:25,230
and paying attention to
detail, and things like that.

1165
00:51:25,230 --> 00:51:27,190
But sometimes you're more
concerned about the content,

1166
00:51:27,190 --> 00:51:28,450
right, and you don't wanna get...

1167
00:51:28,450 --> 00:51:32,750
And it will take a long time
getting everyone up to speed

1168
00:51:32,750 --> 00:51:35,310
on getting R working on
your computer, right?

1169
00:51:35,310 --> 00:51:36,290
That's a whole process,

1170
00:51:36,290 --> 00:51:40,403
and so sometimes those clicky
buttons can be really helpful.

1171
00:51:41,750 --> 00:51:42,930
- Dr. Popolizio.

1172
00:51:45,580 --> 00:51:46,570
- Thank you so much, Tara.

1173
00:51:46,570 --> 00:51:47,543
That was great.

1174
00:51:48,710 --> 00:51:52,490
I have if I could share
a comment from our Q&A

1175
00:51:52,490 --> 00:51:56,190
from a colleague, Dr. David Tapley.

1176
00:51:56,190 --> 00:51:58,530
It's not a question, but
he would like to share

1177
00:51:58,530 --> 00:52:01,560
that he thinks it's one of
the best Darwin talks ever,

1178
00:52:01,560 --> 00:52:03,540
and so well done. (laughs)
(Tara laughing)

1179
00:52:03,540 --> 00:52:04,373
- Thank you.

1180
00:52:04,373 --> 00:52:06,553
- And then I have a question.

1181
00:52:07,690 --> 00:52:12,020
I actually, I'm a
biodiversity scientist myself,

1182
00:52:12,020 --> 00:52:13,500
but on a much smaller scale,

1183
00:52:13,500 --> 00:52:18,350
I study specific group of
organisms, red algae mostly,

1184
00:52:18,350 --> 00:52:19,250
seaweeds in general,

1185
00:52:19,250 --> 00:52:20,970
but mostly tropical red algae

1186
00:52:20,970 --> 00:52:23,287
and use a lot of phylogenetics in my work.

1187
00:52:23,287 --> 00:52:24,580
And so, I was really curious

1188
00:52:24,580 --> 00:52:29,410
about the species delimitation analysis

1189
00:52:29,410 --> 00:52:34,410
that you mentioned and wondering
how those particular models

1190
00:52:34,740 --> 00:52:36,730
that you choose help you decide

1191
00:52:36,730 --> 00:52:39,233
whether there's cryptic diversity or not,

1192
00:52:40,410 --> 00:52:43,930
if there's a specific kind of
cutoff for genetic variation.

1193
00:52:43,930 --> 00:52:45,630
And since you're using,

1194
00:52:45,630 --> 00:52:47,730
since you're looking at so many different,

1195
00:52:49,240 --> 00:52:53,080
you know, categories
of organisms and taxa,

1196
00:52:53,080 --> 00:52:56,910
and that molecular evolution differs

1197
00:52:56,910 --> 00:53:00,480
so greatly between and among those taxa,

1198
00:53:00,480 --> 00:53:02,930
how those models kind of account for that.

1199
00:53:02,930 --> 00:53:06,190
- Yeah, so the approach that we took

1200
00:53:06,190 --> 00:53:08,030
specifically for that project,

1201
00:53:08,030 --> 00:53:10,790
we used two different methods
of species delimitation

1202
00:53:11,810 --> 00:53:15,130
that are both designed
for single-locus data.

1203
00:53:15,130 --> 00:53:18,660
The ABGD is the Automatic
Barcode Gap Detection,

1204
00:53:18,660 --> 00:53:21,640
and that just uses, that's
a distance-based method,

1205
00:53:21,640 --> 00:53:25,060
and so it just looks for genetic variation

1206
00:53:25,060 --> 00:53:26,480
within and between species

1207
00:53:26,480 --> 00:53:27,830
and kind of finds that biggest gap

1208
00:53:27,830 --> 00:53:30,910
and calls those two different species.

1209
00:53:30,910 --> 00:53:34,850
We also use the GMYC, the
General Mixed Yule Coalescent,

1210
00:53:34,850 --> 00:53:36,240
which is a tree-based method.

1211
00:53:36,240 --> 00:53:38,940
And so, you have to estimate
a phylogenetic tree first,

1212
00:53:38,940 --> 00:53:43,940
and it looks for changes
in the branching patterns

1213
00:53:44,060 --> 00:53:49,060
from within population to
between species patterns.

1214
00:53:49,100 --> 00:53:53,180
So, that sort of addresses your question

1215
00:53:53,180 --> 00:53:55,110
about having different rates
of molecular evolution,

1216
00:53:55,110 --> 00:53:57,010
we're kind of using like
these different methods.

1217
00:53:57,010 --> 00:54:00,430
And the GMYC tends to overestimate species

1218
00:54:00,430 --> 00:54:02,860
and the ABGD tends to underestimate.

1219
00:54:02,860 --> 00:54:07,430
And so, when we did this
analysis, we only considered,

1220
00:54:07,430 --> 00:54:08,360
we only called things...

1221
00:54:08,360 --> 00:54:09,670
Actually, forgot to mention this,

1222
00:54:09,670 --> 00:54:11,400
I was gonna mention it on the one slide.

1223
00:54:11,400 --> 00:54:13,240
We only considered things cryptic

1224
00:54:13,240 --> 00:54:14,973
when the two methods matched.

1225
00:54:15,840 --> 00:54:19,660
And so, in a lot of cases
in the one figure I showed

1226
00:54:19,660 --> 00:54:20,710
with like the shadow

1227
00:54:20,710 --> 00:54:23,070
and the silhouettes of
the different organisms,

1228
00:54:23,070 --> 00:54:25,270
there were a couple that
had lines through it.

1229
00:54:25,270 --> 00:54:27,100
And those were in those groups,

1230
00:54:27,100 --> 00:54:29,337
we didn't have consistent results

1231
00:54:29,337 --> 00:54:31,737
and we didn't include
them in the random forest.

1232
00:54:33,460 --> 00:54:35,380
But I have seen in a lot
of species delimitation

1233
00:54:35,380 --> 00:54:37,040
where you do get mixed results

1234
00:54:37,040 --> 00:54:38,410
because all the different methods

1235
00:54:38,410 --> 00:54:39,410
have different assumptions.

1236
00:54:39,410 --> 00:54:40,900
And like you said, different taxa

1237
00:54:40,900 --> 00:54:44,290
have different rates of evolution.

1238
00:54:44,290 --> 00:54:47,050
So, we really kind of feel like
this is a first pass, right?

1239
00:54:47,050 --> 00:54:48,670
There's probably cryptic species

1240
00:54:48,670 --> 00:54:50,030
within that recognized species,

1241
00:54:50,030 --> 00:54:53,280
but now it's time to go back
and reexamine that species.

1242
00:54:53,280 --> 00:54:55,650
Which is why we use the word
hidden diversity sometimes

1243
00:54:55,650 --> 00:54:57,250
because calling them cryptic species

1244
00:54:57,250 --> 00:54:59,040
makes certain groups of people mad.

1245
00:54:59,040 --> 00:55:01,707
(Thea laughing)

1246
00:55:02,700 --> 00:55:03,943
Thank you.
- Mm-hmm.

1247
00:55:05,010 --> 00:55:07,320
- Dr. Pelletier, if I could ask a question

1248
00:55:07,320 --> 00:55:09,940
from the virtual room we're sitting in

1249
00:55:09,940 --> 00:55:11,560
on this side of the webinar.

1250
00:55:11,560 --> 00:55:15,760
You showed us the different
orders of mammalia

1251
00:55:15,760 --> 00:55:20,360
and the proportion that you
guys consider hidden species.

1252
00:55:20,360 --> 00:55:21,990
And one that struck,

1253
00:55:21,990 --> 00:55:25,600
the one that had the greatest
difference were the hyraxes,

1254
00:55:25,600 --> 00:55:27,113
the rock rabbits.

1255
00:55:28,240 --> 00:55:30,920
In the next slide, you
gave some general ideas

1256
00:55:30,920 --> 00:55:32,500
why there might be the difference

1257
00:55:32,500 --> 00:55:35,270
between described versus cryptic species.

1258
00:55:35,270 --> 00:55:37,420
And I know you're a salamander person,

1259
00:55:37,420 --> 00:55:40,630
but any thoughts as to
why the rock hyraxes

1260
00:55:40,630 --> 00:55:43,380
are so potentially different

1261
00:55:43,380 --> 00:55:47,650
in terms of their hidden versus
current described diversity?

1262
00:55:47,650 --> 00:55:48,880
- You know, I was so afraid

1263
00:55:48,880 --> 00:55:51,130
someone was gonna ask me
this question. (laughs)

1264
00:55:51,130 --> 00:55:52,370
I don't know.

1265
00:55:52,370 --> 00:55:54,360
I don't know enough
about hyraxes to answer.

1266
00:55:54,360 --> 00:55:56,930
There are some emologists that
were a part of this project,

1267
00:55:56,930 --> 00:55:58,750
and I feel like I need to go ask them

1268
00:55:58,750 --> 00:56:00,070
specifically about that group

1269
00:56:00,070 --> 00:56:01,760
because I had the same thought

1270
00:56:01,760 --> 00:56:03,110
when I was looking through those slides.

1271
00:56:03,110 --> 00:56:04,610
And I was like, "Shoot,

1272
00:56:04,610 --> 00:56:06,900
I don't really know a
lot about this group."

1273
00:56:06,900 --> 00:56:10,310
That one group didn't have
a high number of species

1274
00:56:10,310 --> 00:56:11,683
in it to start with.

1275
00:56:14,130 --> 00:56:15,630
That's all I can say about it.

1276
00:56:16,570 --> 00:56:19,220
- The reason I ask is I
grew up in Southern Africa

1277
00:56:19,220 --> 00:56:21,670
and we would see them daily.

1278
00:56:21,670 --> 00:56:23,530
Funny little animals.
- Yeah.

1279
00:56:23,530 --> 00:56:27,323
You know, it might just
be a lack of effort,

1280
00:56:28,700 --> 00:56:30,740
not because people aren't trying, right?

1281
00:56:30,740 --> 00:56:33,050
But maybe there's just fewer funds,

1282
00:56:33,050 --> 00:56:35,210
and resources, and that kind of stuff

1283
00:56:35,210 --> 00:56:36,660
to go out in the field there.

1284
00:56:37,660 --> 00:56:38,960
- Great, thanks very much.

1285
00:56:40,330 --> 00:56:43,490
- We have another
question from the audience

1286
00:56:43,490 --> 00:56:46,510
from Dr. Buttner, who
was mentioned earlier.

1287
00:56:46,510 --> 00:56:48,790
He says, "Hello and welcome back.

1288
00:56:48,790 --> 00:56:50,700
Is the Mississippi River being examined

1289
00:56:50,700 --> 00:56:53,090
as a conduit for species diversification?

1290
00:56:53,090 --> 00:56:56,277
Lots of subtropical species
make northern incursions."

1291
00:56:57,930 --> 00:57:00,297
- Yeah, I actually have
had some colleagues

1292
00:57:01,660 --> 00:57:02,800
spend quite a bit of time.

1293
00:57:02,800 --> 00:57:06,550
There are consistent
phylogeographic breaks

1294
00:57:06,550 --> 00:57:09,400
across the Mississippi River.

1295
00:57:09,400 --> 00:57:10,410
That's, I mean, I don't think

1296
00:57:10,410 --> 00:57:12,260
I can get into much
more details than that.

1297
00:57:12,260 --> 00:57:13,850
I think, did I answer the question?

1298
00:57:13,850 --> 00:57:18,540
But I would say yes, it is a common break

1299
00:57:18,540 --> 00:57:21,620
that would likely lead to speciation.

1300
00:57:21,620 --> 00:57:23,880
- I think actually another,

1301
00:57:23,880 --> 00:57:25,640
I'm coming at this from my brother

1302
00:57:25,640 --> 00:57:27,800
used to live in a Illinois,
Southern Illinois.

1303
00:57:27,800 --> 00:57:30,600
And there were lots of species of plants

1304
00:57:30,600 --> 00:57:31,970
that were in Southern Illinois

1305
00:57:31,970 --> 00:57:34,260
that shouldn't be there
'cause it's too cold,

1306
00:57:34,260 --> 00:57:36,450
but that's right along
the Mississippi River.

1307
00:57:36,450 --> 00:57:39,090
- Hmm.
- And so, could there be...

1308
00:57:39,090 --> 00:57:41,920
Are people looking at how that particular

1309
00:57:41,920 --> 00:57:45,612
large body of water could be a conduit-

1310
00:57:45,612 --> 00:57:47,770
- Oh.
- for allowing species

1311
00:57:47,770 --> 00:57:49,450
to migrate in different places?

1312
00:57:49,450 --> 00:57:51,963
- Yeah, that's a good question.

1313
00:57:55,850 --> 00:57:57,670
I think we should explore that.

1314
00:57:57,670 --> 00:58:01,643
- Okay, cool, cool. (laughs)
(Tara laughing)

1315
00:58:03,850 --> 00:58:05,970
- Do we have time for another question?

1316
00:58:05,970 --> 00:58:07,420
- Sure.
- Just (indistinct).

1317
00:58:08,410 --> 00:58:13,410
Is it possible to give a
sense of among your data

1318
00:58:13,700 --> 00:58:16,840
that you're working with kind of comparing

1319
00:58:16,840 --> 00:58:20,780
how much information
you have for terrestrial

1320
00:58:20,780 --> 00:58:22,350
versus marine ecosystems,

1321
00:58:22,350 --> 00:58:26,580
and then for microbial
versus macroscopic life?

1322
00:58:26,580 --> 00:58:30,623
- Yeah, so it is, it's
dominated by animals.

1323
00:58:32,870 --> 00:58:34,670
There are a few proteus in there,

1324
00:58:34,670 --> 00:58:36,070
we don't have any data

1325
00:58:36,070 --> 00:58:40,560
for like microorganisms

1326
00:58:40,560 --> 00:58:41,960
like bacteria and that kind of stuff.

1327
00:58:41,960 --> 00:58:45,100
We've sort of, I mean, I
don't even know how to,

1328
00:58:45,100 --> 00:58:48,530
I can barely describe what a
species is in a salamander.

1329
00:58:48,530 --> 00:58:52,000
You know, so doing that in microbes

1330
00:58:52,000 --> 00:58:53,340
kind of blows my mind a little bit.

1331
00:58:53,340 --> 00:58:55,860
So, it's dominated by animals.

1332
00:58:55,860 --> 00:58:57,360
There are also a lot of plants,

1333
00:58:57,360 --> 00:58:58,770
like I said, there are a few proteus,

1334
00:58:58,770 --> 00:59:00,310
but that's really all we have now.

1335
00:59:00,310 --> 00:59:02,973
And it is mostly arthropods,

1336
00:59:04,430 --> 00:59:06,383
which I guess probably
isn't that surprising.

1337
00:59:09,200 --> 00:59:13,400
- I think with that, we'll
have to call it an end.

1338
00:59:13,400 --> 00:59:15,353
Thank you so much again, Dr. Pelletier.

