1 00:00:03,200 --> 00:00:10,040 The world we live in is awash with data that comes pouring in from everywhere around us. 2 00:00:10,040 --> 00:00:14,520 On its own this data is just noise and confusion. 3 00:00:14,520 --> 00:00:22,520 To make sense of data, to find the meaning in it, we need the powerful branch of science - statistics. 4 00:00:22,520 --> 00:00:26,040 Believe me there's nothing boring about statistics. 5 00:00:26,040 --> 00:00:29,400 Especially not today when we can make the data sing. 6 00:00:29,400 --> 00:00:33,400 With statistics we can really make sense of the world. 7 00:00:33,400 --> 00:00:35,040 And there's more. 8 00:00:35,040 --> 00:00:40,440 With statistics, the data deluge, as it's being called, is leading us 9 00:00:40,440 --> 00:00:46,240 to an ever greater understanding of life on Earth and the universe beyond. 10 00:00:46,240 --> 00:00:50,760 And thanks to the incredible power of today's computers, 11 00:00:50,760 --> 00:00:57,040 it may fundamentally transform the process of scientific discovery. 12 00:00:57,040 --> 00:01:02,560 I kid you not, statistics is now the sexiest subject around. 13 00:01:23,000 --> 00:01:25,600 Did you know that there is one million boats in Sweden? 14 00:01:25,600 --> 00:01:27,960 That's one boat per nine people! 15 00:01:27,960 --> 00:01:31,080 It's the highest number of boats per person in Europe! 16 00:01:41,080 --> 00:01:45,760 Being a statistician, you don't like telling your profession at dinner parties. 17 00:01:45,760 --> 00:01:48,440 But really, statisticians shouldn't be shy 18 00:01:48,440 --> 00:01:51,320 because everyone wants to understand what's going on. 19 00:01:51,320 --> 00:01:56,480 And statistics gives us a perspective on the world we live in 20 00:01:56,480 --> 00:01:59,320 that we can't get in any other way. 21 00:02:03,520 --> 00:02:09,000 Statistics tells us whether the things we think and believe are actually true. 22 00:02:19,960 --> 00:02:25,440 And statistics are far more useful than we usually like to admit. 23 00:02:25,440 --> 00:02:29,600 In the last recession there was this famous call-in to a talk radio station. 24 00:02:29,600 --> 00:02:37,280 The man complained, "In times like this when unemployment rates are up to 13%, income has fallen by 5%, 25 00:02:37,280 --> 00:02:41,360 "and suicide rates are climbing, and I get so angry that the government 26 00:02:41,360 --> 00:02:45,520 "is wasting money on things like collection of statistics." 27 00:02:48,240 --> 00:02:50,360 I'm not officially a statistician. 28 00:02:50,360 --> 00:02:55,280 Strictly speaking, my field is global health. 29 00:02:58,120 --> 00:03:03,280 But I got really obsessed with stats when I realised how much people 30 00:03:03,280 --> 00:03:06,240 in Sweden just don't know about the rest of the world. 31 00:03:06,240 --> 00:03:10,800 I started in our medical university, Karolinska Institutet, 32 00:03:10,800 --> 00:03:13,960 an undergraduate course called Global Health. 33 00:03:13,960 --> 00:03:17,360 These students coming to us actually have the highest grade you can get 34 00:03:17,360 --> 00:03:18,840 in the Swedish college system, 35 00:03:18,840 --> 00:03:22,040 so I thought, "Maybe they know everything I'm going to teach them." 36 00:03:22,040 --> 00:03:25,680 So I did a pre-test when they came, and one of the questions 37 00:03:25,680 --> 00:03:28,160 from which I learned a lot was this one - 38 00:03:28,160 --> 00:03:32,360 which country has the highest child mortality of these five pairs? 39 00:03:32,360 --> 00:03:34,920 I won't put you at test here, but it is Turkey 40 00:03:34,920 --> 00:03:37,000 which is highest there, Poland, 41 00:03:37,000 --> 00:03:40,760 Russia, Pakistan, and South Africa. 42 00:03:40,760 --> 00:03:43,080 And these were the result of the Swedish students. 43 00:03:43,080 --> 00:03:44,760 A 1.8 right answer out of five possible. 44 00:03:44,760 --> 00:03:49,920 And that means there was a place for a professor of International Health and for my course. 45 00:03:49,920 --> 00:03:56,360 But one late night when I was compiling the report, I really realised my discovery. 46 00:03:56,360 --> 00:04:01,160 I had shown that Swedish top students know statistically 47 00:04:01,160 --> 00:04:04,480 significantly less about the world than the chimpanzees. 48 00:04:06,000 --> 00:04:09,840 Because the chimpanzees would score half right. 49 00:04:09,840 --> 00:04:12,320 If I gave them two bananas with Sri Lanka and Turkey, 50 00:04:12,320 --> 00:04:15,600 they would be right half of the cases, but the students are not there. 51 00:04:15,600 --> 00:04:20,200 I did also an unethical study of the professors of the Karolinska Institutet, 52 00:04:20,200 --> 00:04:25,520 that hands out the Nobel Prize for medicine, and they are on par with the chimpanzees there. 53 00:04:28,160 --> 00:04:32,680 Today there's more information accessible than ever before. 54 00:04:32,680 --> 00:04:35,760 'And I work with my team at the Gapminder Foundation 55 00:04:35,760 --> 00:04:41,600 'using new tools that help everyone make sense of the changing world. 56 00:04:41,600 --> 00:04:45,320 'We draw on the masses of data that's now freely available 57 00:04:45,320 --> 00:04:49,720 'from international institutions like the UN and the World Bank. 58 00:04:49,720 --> 00:04:53,640 'And it's become my mission to share the insights 59 00:04:53,640 --> 00:05:00,200 'from this data with anyone who'll listen, and to reveal how statistics is nothing to be frightened of.' 60 00:05:02,440 --> 00:05:05,040 I'm going to provide you a view of 61 00:05:05,040 --> 00:05:09,000 the global health situation across mankind. 62 00:05:09,000 --> 00:05:14,160 And I'm going to do that in hopefully an enjoyable way, so relax. 63 00:05:14,160 --> 00:05:17,120 So we did this software which displays it like this. 64 00:05:17,120 --> 00:05:19,320 Every bubble here is a country - 65 00:05:19,320 --> 00:05:21,320 this is China, this is India. 66 00:05:21,320 --> 00:05:23,560 The size of the bubble is the population. 67 00:05:23,560 --> 00:05:27,600 I'm going to stage a race between this sort of yellowish Ford here 68 00:05:27,600 --> 00:05:32,760 and the red Toyota down there and the brownish Volvo. 69 00:05:32,760 --> 00:05:36,440 The Toyota has a very bad start down here, and United States, 70 00:05:36,440 --> 00:05:38,280 Ford is going off-road there, 71 00:05:38,280 --> 00:05:40,480 and the Volvo is doing quite fine, this is the war. 72 00:05:40,480 --> 00:05:43,680 The Toyota got off track, now Toyota is on the healthier side of Sweden. 73 00:05:43,680 --> 00:05:46,800 That's about where I sold the Volvo and bought the Toyota. 74 00:05:46,800 --> 00:05:47,960 AUDIENCE LAUGH 75 00:05:47,960 --> 00:05:50,840 This is the great leap forward, when China fell down. 76 00:05:50,840 --> 00:05:53,080 It was the central planning by Mao Zedong. 77 00:05:53,080 --> 00:05:56,680 China recovered and said, "Never more stupid central planning," 78 00:05:56,680 --> 00:05:57,800 but they went up here. 79 00:05:57,800 --> 00:06:02,560 No, there is one more inequity, look there - United States 80 00:06:02,560 --> 00:06:07,480 They broke my frame. Washington DC is so rich over there, 81 00:06:07,480 --> 00:06:13,040 but they are not as healthy as Kerala in India. It's quite interesting, isn't it? 82 00:06:13,040 --> 00:06:14,600 LAUGHTER AND APPLAUSE 83 00:06:20,360 --> 00:06:25,520 Welcome to the USA, world leaders in big cars 84 00:06:25,520 --> 00:06:28,480 and free data. 85 00:06:28,480 --> 00:06:35,880 There are many here who share my vision of making public data accessible and useful for everyone. 86 00:06:35,880 --> 00:06:43,440 The city of San Francisco is in the lead, opening up its data on everything. 87 00:06:43,440 --> 00:06:47,480 Even the police department is releasing all its crime reports. 88 00:06:47,480 --> 00:06:50,840 This official crime data has been turned 89 00:06:50,840 --> 00:06:55,960 into a wonderful interactive map by two of the city's computer whizzes. 90 00:06:55,960 --> 00:06:58,920 It's community statistics in action. 91 00:07:09,400 --> 00:07:13,320 Crimespotting is a map of crime reports from the San Francisco Police Department 92 00:07:13,320 --> 00:07:16,120 showing dots on maps for citizens to be able to see 93 00:07:16,120 --> 00:07:19,320 patterns of crime around their neighbourhoods in San Francisco. 94 00:07:19,320 --> 00:07:25,080 The map is not just about individual crimes but about broader patterns that show you where crime is 95 00:07:25,080 --> 00:07:27,760 clustered around the city, which areas have high crime, 96 00:07:27,760 --> 00:07:30,320 and which areas have relatively low crime. 97 00:07:36,840 --> 00:07:41,440 We're here at the top of Jones Street on Nob Hill... 98 00:07:42,960 --> 00:07:45,280 ..quite a nice neighbourhood. 99 00:07:45,280 --> 00:07:49,600 What the crime maps show us is the relationship between 100 00:07:49,600 --> 00:07:51,360 topography and crime. 101 00:07:51,360 --> 00:07:54,520 Basically the higher up the hill, the less crime there is. 102 00:07:56,200 --> 00:07:58,640 You cross over the border 103 00:07:58,640 --> 00:08:00,240 into the flats... 104 00:08:02,800 --> 00:08:09,240 Essentially as soon as you get into the lower lying areas of Jones Street the crime just skyrockets. 105 00:08:20,240 --> 00:08:24,160 We're here in the uptown Tenderloin district. 106 00:08:26,040 --> 00:08:30,320 It's one of the oldest and densest neighbourhoods in San Francisco. 107 00:08:30,320 --> 00:08:32,400 This is where you go to buy drugs. 108 00:08:32,400 --> 00:08:33,920 Right around here. 109 00:08:37,200 --> 00:08:41,640 We see lots of aggravated assaults, lots of auto thefts. 110 00:08:41,640 --> 00:08:48,520 Basically a huge part of the crime that happens in the city happens in this five or six block radius. 111 00:08:55,640 --> 00:08:58,920 If you've been hearing police sirens in your neighbourhood, 112 00:08:58,920 --> 00:09:02,000 you can use the map to find out why. 113 00:09:02,000 --> 00:09:05,680 If you're out at night in an unfamiliar part of town, 114 00:09:05,680 --> 00:09:09,240 you can check the map for streets to avoid. 115 00:09:09,240 --> 00:09:12,400 If a neighbour gets burgled, you can see - 116 00:09:12,400 --> 00:09:16,520 is it a one-off or has there been a spike in local crime? 117 00:09:16,520 --> 00:09:19,480 If you commute through a neighbourhood and you're worried 118 00:09:19,480 --> 00:09:23,080 about its safety, the fact that we have the ability to turn off all 119 00:09:23,080 --> 00:09:25,360 the night-time and middle-of-the-day crimes 120 00:09:25,360 --> 00:09:28,280 and show you just the things that are happening during the commute, 121 00:09:28,280 --> 00:09:32,880 it is a statistical operation. But I think to people that are interacting with the thing 122 00:09:32,880 --> 00:09:38,000 it feels very much more like they're just sort of browsing a website or shopping on Amazon. 123 00:09:38,000 --> 00:09:43,520 They're looking at data and they don't realise they're doing statistics. 124 00:09:43,520 --> 00:09:47,840 What's most exciting for me is that public statistics 125 00:09:47,840 --> 00:09:52,640 is making citizens more powerful and the authorities more accountable. 126 00:10:02,360 --> 00:10:04,760 We have community meetings that the police attend 127 00:10:04,760 --> 00:10:08,880 and what citizens are now doing are bringing printouts 128 00:10:08,880 --> 00:10:12,240 of the maps that show where crimes are taking place, 129 00:10:12,240 --> 00:10:16,120 and they're demanding services from the police department 130 00:10:16,120 --> 00:10:20,520 and the police department is now having to change how they police, 131 00:10:20,520 --> 00:10:22,960 how they provide policing services, 132 00:10:22,960 --> 00:10:27,040 because the data is showing what is working and what is not. 133 00:10:28,560 --> 00:10:31,960 People in San Francisco are also using public data 134 00:10:31,960 --> 00:10:35,800 to map social inequalities and see how to improve society. 135 00:10:35,800 --> 00:10:39,720 And the possibilities are endless. 136 00:10:39,720 --> 00:10:43,160 I think our dream government data analysis project 137 00:10:43,160 --> 00:10:46,240 would really be focused on live information, 138 00:10:46,240 --> 00:10:51,240 on stuff that was being reported and pushed out to the world over the internet as it was happening. 139 00:10:51,240 --> 00:10:55,040 You know, trash pickups, traffic accidents, buses, 140 00:10:55,040 --> 00:10:57,680 and I think through the kind of stats-gathering power 141 00:10:57,680 --> 00:11:02,520 of the internet it's possible to really begin to see the workings of the city 142 00:11:02,520 --> 00:11:04,760 displayed as a unified interface. 143 00:11:07,320 --> 00:11:09,960 So that's where we are heading. 144 00:11:09,960 --> 00:11:14,760 Towards a world of free data with all the statistical insights that come from it, 145 00:11:14,760 --> 00:11:21,800 accessible to everyone, empowering us as citizens and letting us hold our rulers to account. 146 00:11:21,800 --> 00:11:26,920 It's a long way from where statistics began. 147 00:11:26,920 --> 00:11:32,880 Statistics are essential to us to monitor our governments and our societies. 148 00:11:32,880 --> 00:11:36,760 But it was our rulers up there who started 149 00:11:36,760 --> 00:11:40,840 the collection of statistics in the first place in order to monitor us! 150 00:11:46,880 --> 00:11:51,440 In fact the word 'statistics' comes from 'the state'. 151 00:11:51,440 --> 00:11:55,600 Modern statistics began two centuries ago. 152 00:11:55,600 --> 00:11:59,080 Once it got going, it spread and never stopped. 153 00:11:59,080 --> 00:12:01,640 And guess who was first! 154 00:12:03,280 --> 00:12:07,560 The Chinese have Confucius, the Italians have da Vinci, 155 00:12:07,560 --> 00:12:10,240 and the British have Shakespeare. 156 00:12:10,240 --> 00:12:12,440 And we have the Tabellverket - 157 00:12:12,440 --> 00:12:16,400 the first ever systematic collection of statistics! 158 00:12:16,400 --> 00:12:21,640 Since the year 1749 we have collected data 159 00:12:21,640 --> 00:12:26,920 on every birth, marriage and death, and we are proud of it! 160 00:12:29,120 --> 00:12:32,000 The Tabellverket recorded information 161 00:12:32,000 --> 00:12:34,040 from every parish in Sweden. 162 00:12:34,040 --> 00:12:39,080 It was a huge quantity of data and it was the first time any government 163 00:12:39,080 --> 00:12:41,800 could get an accurate picture of its people. 164 00:12:49,360 --> 00:12:53,360 Sweden had been the greatest military power in Northern Europe, 165 00:12:53,360 --> 00:12:58,200 but by 1749 our star was really fading 166 00:12:58,200 --> 00:13:00,920 and other countries were growing stronger. 167 00:13:00,920 --> 00:13:03,600 At least we were a large power, 168 00:13:03,600 --> 00:13:09,960 thought to have 20 million people, enough to rival Britain and France. 169 00:13:13,400 --> 00:13:18,160 But we were in for a nasty surprise. 170 00:13:18,160 --> 00:13:20,680 The first analysis of the Tabellverket 171 00:13:20,680 --> 00:13:24,080 revealed that Sweden only had two million inhabitants. 172 00:13:24,080 --> 00:13:30,680 Sweden was not just a power in decline, it also had a very small population. 173 00:13:30,680 --> 00:13:36,080 The government was horrified by this finding - what if the enemy found out? 174 00:13:37,840 --> 00:13:44,560 But the Tabellverket also showed that many women died in childbirth and many children died young. 175 00:13:44,560 --> 00:13:48,640 So government took action to improve the health of the people. 176 00:13:48,640 --> 00:13:52,440 This was the beginning of modern Sweden. 177 00:13:53,960 --> 00:13:59,000 It took more than 50 years before the Austrians, Belgians, Danes, 178 00:13:59,000 --> 00:14:02,320 Dutch, French, Germans, Italians 179 00:14:02,320 --> 00:14:08,600 and, finally, the British, caught up with Sweden in collecting and using statistics. 180 00:14:24,640 --> 00:14:29,640 It was called political arithmetic. It was a lovely phrase that was used for statistics. 181 00:14:29,640 --> 00:14:33,160 Governments could have much more control and understanding of 182 00:14:33,160 --> 00:14:36,840 the society - how it was working, how it was developing 183 00:14:36,840 --> 00:14:40,240 and essentially so they could control it better. 184 00:14:43,360 --> 00:14:47,960 It wasn't just governments who woke up to the power of statistics. 185 00:14:47,960 --> 00:14:54,600 Right across Europe, 19th century society went mad for facts. 186 00:14:54,600 --> 00:14:57,600 And, despite its late start, Britain, 187 00:14:57,600 --> 00:15:01,400 with its Royal Statistical Society in London, 188 00:15:01,400 --> 00:15:04,000 was soon a statisticians' nirvana. 189 00:15:05,920 --> 00:15:09,960 I love looking at old copies of the Royal Statistical Society journal 190 00:15:09,960 --> 00:15:11,760 because it's full of such odd stuff. 191 00:15:11,760 --> 00:15:14,840 There's a wonderful paper from the 1840s 192 00:15:14,840 --> 00:15:19,200 which shows a map of England and the rates of bastardy in each county. 193 00:15:19,200 --> 00:15:23,560 So you can identify very quickly the areas with high rates of bastardy. 194 00:15:23,560 --> 00:15:27,240 Being in East Anglia it always makes me slightly laugh that Norfolk 195 00:15:27,240 --> 00:15:30,720 seems to top the "bastardy league" in the 1840s. 196 00:15:30,720 --> 00:15:36,800 One of the founders of the Royal Statistical Society 197 00:15:36,800 --> 00:15:42,120 was the great Victorian mathematician and inventor Charles Babbage. 198 00:15:42,120 --> 00:15:50,120 In 1842 he read the latest poem by an equally great Victorian, Alfred Tennyson. 199 00:15:50,120 --> 00:15:53,120 Vision of Sin contained the lines: 200 00:15:53,120 --> 00:15:55,800 "Fill the cup, and fill the can 201 00:15:55,800 --> 00:15:58,160 "Have a rouse before the morn 202 00:15:58,160 --> 00:16:03,720 "Every moment dies a man Every moment one is born." 203 00:16:03,720 --> 00:16:07,360 So keen a statistician was Babbage that he could not contain himself. 204 00:16:07,360 --> 00:16:09,360 He dashed off a letter to Tennyson 205 00:16:09,360 --> 00:16:12,200 explaining that because of population growth, 206 00:16:12,200 --> 00:16:13,640 the line should read, 207 00:16:13,640 --> 00:16:18,640 "Every moment dies a man and one and a 16th is born." 208 00:16:18,640 --> 00:16:22,480 I may add that the exact figure is 1.067, 209 00:16:22,480 --> 00:16:27,200 but something must be conceded to the laws of metre. 210 00:16:31,840 --> 00:16:36,640 In the 19th century, scholars all over Europe did amazing work 211 00:16:36,640 --> 00:16:39,000 in measuring their societies. 212 00:16:39,000 --> 00:16:42,600 They were hoovering up data on almost everything. 213 00:16:42,600 --> 00:16:46,040 But numbers alone don't tell you anything. 214 00:16:46,040 --> 00:16:51,320 You have to analyse them, and that's what makes statistics. 215 00:16:55,760 --> 00:16:59,200 When the first statisticians began to get to grips with 216 00:16:59,200 --> 00:17:00,400 analysing their data 217 00:17:00,400 --> 00:17:05,760 they seized upon the average, and they took the average of everything. 218 00:17:09,720 --> 00:17:13,760 What's so great about an average is that 219 00:17:13,760 --> 00:17:18,640 you can take a whole mass of data and reduce it to a single number. 220 00:17:21,880 --> 00:17:26,120 And though each of us is unique, our collective lives produce 221 00:17:26,120 --> 00:17:29,880 averages that can characterise whole populations. 222 00:17:41,280 --> 00:17:45,360 I looked in my local newspaper one week and saw a pensioner 223 00:17:45,360 --> 00:17:49,440 had accidentally put her foot on the accelerator 224 00:17:49,440 --> 00:17:52,560 and crushed her friend against a wall. 225 00:17:52,560 --> 00:17:56,360 Devastating, hideous, horrible thing to happen. 226 00:17:56,360 --> 00:18:01,400 And then there was a second one about a young man who didn't have 227 00:18:01,400 --> 00:18:07,040 a driving licence, was driving a car under the influence of drugs and alcohol 228 00:18:07,040 --> 00:18:10,320 and he bashed into a pedestrian and killed him. 229 00:18:10,320 --> 00:18:15,560 What's remarkable, absolutely remarkable, if you look at the number 230 00:18:15,560 --> 00:18:22,880 of people who die each year in traffic crashes, it's nearly a constant. 231 00:18:22,880 --> 00:18:24,480 What? 232 00:18:24,480 --> 00:18:31,680 All these individual events, somehow when you sum them all up there's the same number every year. 233 00:18:31,680 --> 00:18:35,080 And every year, two and a half times as many men 234 00:18:35,080 --> 00:18:38,880 die in traffic crashes as women, and it's a constant. 235 00:18:38,880 --> 00:18:44,320 And every year the rate in Belgium is double the rate in England. 236 00:18:44,320 --> 00:18:47,160 There are these remarkable regularities. 237 00:18:47,160 --> 00:18:54,800 So that these individual particular events sum up into a social phenomenon. 238 00:18:56,560 --> 00:18:58,120 Let's see what Sweden have done. 239 00:18:58,120 --> 00:19:01,560 We used to boast about fast social progress, that's where we were.... 240 00:19:01,560 --> 00:19:05,240 'In my lectures, to tell stories about the changing world, 241 00:19:05,240 --> 00:19:08,120 'I use the averages from entire countries, 242 00:19:08,120 --> 00:19:12,160 'whether the average of income, child mortality, family size 243 00:19:12,160 --> 00:19:13,360 'or carbon output.' 244 00:19:13,360 --> 00:19:16,200 OK, I give you Singapore. The year I was born, 245 00:19:16,200 --> 00:19:20,560 Singapore had twice the child mortality of Sweden, the most tropical country in the world, 246 00:19:20,560 --> 00:19:22,920 a marshland on the Equator, and here we go. 247 00:19:22,920 --> 00:19:25,160 It took a little time for them to get independent, 248 00:19:25,160 --> 00:19:27,160 but then they started to grow their economy, 249 00:19:27,160 --> 00:19:29,840 and they made the social investment, they got away malaria, 250 00:19:29,840 --> 00:19:33,360 they got a magnificent health system that beat both US and Sweden. 251 00:19:33,360 --> 00:19:37,600 We never thought it would happen that they would win over Sweden! 252 00:19:37,600 --> 00:19:40,520 LAUGHTER AND APPLAUSE 253 00:19:40,520 --> 00:19:46,400 But useful as averages are, they don't tell you the whole story. 254 00:19:48,800 --> 00:19:53,040 On average, Swedish people have slightly less than two legs. 255 00:19:53,040 --> 00:19:57,560 This is because few people only have one leg or no legs, 256 00:19:57,560 --> 00:19:59,760 and no-one has three legs. 257 00:19:59,760 --> 00:20:06,240 So almost everybody in Sweden has more than the average number of legs. 258 00:20:06,240 --> 00:20:10,840 The variation in data is just as important as the average. 259 00:20:16,800 --> 00:20:19,400 But how do you get a handle on variation? 260 00:20:19,400 --> 00:20:23,000 For this, you transform numbers into shapes. 261 00:20:23,000 --> 00:20:26,320 Let's look again at the number of adult women in Sweden 262 00:20:26,320 --> 00:20:27,800 for different heights. 263 00:20:27,800 --> 00:20:31,800 Plotting the data as a shape shows how much their heights 264 00:20:31,800 --> 00:20:36,400 vary from the average and how wide that variation is. 265 00:20:36,400 --> 00:20:41,520 The shape a set of data makes is called its distribution. 266 00:20:41,520 --> 00:20:46,080 This is the income distribution of China, 1970. 267 00:20:46,080 --> 00:20:51,000 This is the income distribution of the United States, 1970. 268 00:20:51,000 --> 00:20:54,080 Almost no overlap, and what has happened? 269 00:20:54,080 --> 00:20:56,880 China is growing, it's not so equal any longer, 270 00:20:56,880 --> 00:21:01,120 and it's appearing here overlooking the United States. 271 00:21:01,120 --> 00:21:03,480 Almost like a ghost, isn't it? 272 00:21:03,480 --> 00:21:05,160 It's pretty scary. 273 00:21:05,160 --> 00:21:06,680 Rrrr! 274 00:21:06,680 --> 00:21:08,200 LAUGHTER 275 00:21:17,160 --> 00:21:21,280 The statisticians who first explored distribution 276 00:21:21,280 --> 00:21:25,760 discovered one shape that turned up again and again. 277 00:21:25,760 --> 00:21:28,120 The Victorian scholar Francis Galton 278 00:21:28,120 --> 00:21:32,400 was so fascinated he built a machine that could reproduce it, 279 00:21:32,400 --> 00:21:36,080 and he found it fitted so many different sets of measurements 280 00:21:36,080 --> 00:21:38,640 that he named it the normal distribution. 281 00:21:38,640 --> 00:21:45,600 Whether it was people's arm spans, lung capacities, 282 00:21:45,600 --> 00:21:47,400 or even their exam results, 283 00:21:47,400 --> 00:21:51,360 the normal distribution shape recurred time and time again. 284 00:21:51,360 --> 00:21:56,360 Other statisticians soon found many other regular shapes, 285 00:21:56,360 --> 00:22:01,360 each produced by particular kinds of natural or social processes. 286 00:22:01,360 --> 00:22:05,400 And every statistician has their favourite. 287 00:22:05,400 --> 00:22:09,280 The Poisson distribution, the Poisson shape is my favourite distribution. 288 00:22:09,280 --> 00:22:11,120 I think it's an absolute cracker. 289 00:22:15,760 --> 00:22:18,720 The Poisson shape describes how likely it is 290 00:22:18,720 --> 00:22:21,680 that out-of-the-ordinary things will happen. 291 00:22:21,680 --> 00:22:24,520 Imagine a London bus stop where we know that on average 292 00:22:24,520 --> 00:22:26,280 we'll get three buses in an hour. 293 00:22:26,280 --> 00:22:29,280 We won't always get three buses, of course. 294 00:22:29,280 --> 00:22:33,480 Amazingly, the Poisson shape will show us the probability 295 00:22:33,480 --> 00:22:37,200 that in any given hour we will get four, five, or six buses, 296 00:22:37,200 --> 00:22:39,440 or no buses at all. 297 00:22:40,720 --> 00:22:43,480 The exact shape changes with the average. 298 00:22:43,480 --> 00:22:46,920 But whether it's how many people will win the lottery jackpot 299 00:22:46,920 --> 00:22:48,000 each week, 300 00:22:48,000 --> 00:22:51,200 or how many people will phone a call centre each minute, 301 00:22:51,200 --> 00:22:54,120 the Poisson shape will give the probabilities. 302 00:22:57,240 --> 00:23:01,240 The wonderful example where this was applied to in the late 19th century 303 00:23:01,240 --> 00:23:04,400 was to count each year the number of Prussian officers, 304 00:23:04,400 --> 00:23:07,520 cavalry officers, who were kicked to death by their horses. 305 00:23:07,520 --> 00:23:10,240 Now, some years there were none, some years there were one, 306 00:23:10,240 --> 00:23:13,880 some years there were two, up to seven, I think, one particularly bad year. 307 00:23:13,880 --> 00:23:16,680 But with this distribution, however many years there were 308 00:23:16,680 --> 00:23:19,640 with nought, one, two, three, four Prussian cavalry officers 309 00:23:19,640 --> 00:23:23,880 kicked to death by their horses, beautifully obeyed the Poisson distribution. 310 00:23:42,800 --> 00:23:48,520 So statisticians use shapes to reveal the patterns in the data. 311 00:23:48,520 --> 00:23:51,000 But we also use images of all kinds 312 00:23:51,000 --> 00:23:54,480 to communicate statistics to a wider public. 313 00:23:54,480 --> 00:23:57,320 Because if the story in the numbers 314 00:23:57,320 --> 00:24:02,920 is told by a beautiful and clever image, then everyone understands. 315 00:24:02,920 --> 00:24:09,640 Of the pioneers of statistical graphics, my favourite is Florence Nightingale. 316 00:24:24,280 --> 00:24:27,120 There are not many people who realise that she was known 317 00:24:27,120 --> 00:24:30,520 as a passionate statistician and not just the Lady of the Lamp. 318 00:24:30,520 --> 00:24:34,720 She said that "to understand God's thoughts, we must study statistics, 319 00:24:34,720 --> 00:24:37,080 "for these are the measure of His purpose." 320 00:24:37,080 --> 00:24:40,520 Statistics was for her a religious duty and moral imperative. 321 00:24:42,080 --> 00:24:45,360 When Florence was nine years old she started collecting data. 322 00:24:45,360 --> 00:24:48,320 Her data was different fruits and vegetables she found. 323 00:24:48,320 --> 00:24:50,080 Put them into different tables. 324 00:24:50,080 --> 00:24:52,640 Trying to organise them in some standard form. 325 00:24:52,640 --> 00:24:55,640 And so we have one of Nightingale's first statistical tables 326 00:24:55,640 --> 00:24:57,440 at the age of nine. 327 00:25:04,360 --> 00:25:11,440 In the mid 1850s Florence Nightingale went to the Crimea to care for British casualties of war. 328 00:25:11,440 --> 00:25:14,400 She was horrified by what she discovered. 329 00:25:14,400 --> 00:25:19,920 For all the soldiers being blown to bits on the battlefield, there were many, many more soldiers 330 00:25:19,920 --> 00:25:25,200 dying from diseases they caught in the army's filthy hospitals. 331 00:25:25,200 --> 00:25:29,120 So Florence Nightingale began counting the dead. 332 00:25:29,120 --> 00:25:34,920 For two years she recorded mortality data in meticulous detail. 333 00:25:34,920 --> 00:25:39,120 When the war was over she persuaded the government to set up 334 00:25:39,120 --> 00:25:41,360 a Royal Commission of Inquiry, 335 00:25:41,360 --> 00:25:44,680 and gathered her data in a devastating report. 336 00:25:44,680 --> 00:25:48,480 What has cemented her place in the statistical history books 337 00:25:48,480 --> 00:25:50,120 are the graphics she used. 338 00:25:50,120 --> 00:25:53,960 And one in particular, the polar area graph. 339 00:25:53,960 --> 00:25:58,680 For each month of the war, a huge blue wedge represented 340 00:25:58,680 --> 00:26:02,200 the soldiers who had died from preventable diseases. 341 00:26:02,200 --> 00:26:05,560 The much smaller red wedges were deaths from wounds, 342 00:26:05,560 --> 00:26:10,600 and the black wedges were deaths from accidents and other causes. 343 00:26:10,600 --> 00:26:17,040 Nightingale's graphics were so clear they were impossible to ignore. 344 00:26:17,040 --> 00:26:19,360 The usual thing around Florence Nightingale's time 345 00:26:19,360 --> 00:26:23,920 was just to produce tables and tables of figures - absolutely really tedious stuff that, 346 00:26:23,920 --> 00:26:26,320 unless you're an absolutely dedicated statistician, 347 00:26:26,320 --> 00:26:29,240 it's really quite difficult to spot the patterns quite naturally. 348 00:26:29,240 --> 00:26:33,480 But visualisations, they tell a story, they tell a story immediately. 349 00:26:33,480 --> 00:26:38,480 And the use of colour and the use of shape can really tell a powerful story. 350 00:26:38,480 --> 00:26:41,280 And nowadays of course we can make things move as well. 351 00:26:41,280 --> 00:26:44,320 Florence Nightingale would have loved to have played with... 352 00:26:44,320 --> 00:26:48,800 She would have produced wonderful animations, I'm absolutely certain of it. 353 00:26:50,880 --> 00:26:54,800 Today, 150 years on, Nightingale's graphics 354 00:26:54,800 --> 00:26:57,800 are rightly regarded as a classic. 355 00:26:57,800 --> 00:27:00,600 They led to a revolution in nursing, health care 356 00:27:00,600 --> 00:27:05,880 and hygiene in hospitals worldwide, which saved innumerable lives. 357 00:27:07,400 --> 00:27:11,040 And statistical graphics has become an art form of its very own, 358 00:27:11,040 --> 00:27:16,280 led by designers who are passionate about visualising data. 359 00:27:24,640 --> 00:27:27,120 This is the Billion Pound-O-Gram. 360 00:27:27,120 --> 00:27:29,120 This image arose out of frustration 361 00:27:29,120 --> 00:27:32,280 with the reporting of billion pound amounts in the media. 362 00:27:32,280 --> 00:27:34,400 £500 billion pounds for this war. 363 00:27:34,400 --> 00:27:36,000 £50 billion for this oil spill. 364 00:27:36,000 --> 00:27:39,440 It doesn't make sense - the numbers are too enormous to get your mind round. 365 00:27:39,440 --> 00:27:43,520 So I scraped all this data from various news sources and created this diagram. 366 00:27:43,520 --> 00:27:48,680 So the squares here are scaled according to the billion pound amounts. 367 00:27:48,680 --> 00:27:51,840 When you see numbers visualised like this 368 00:27:51,840 --> 00:27:54,240 you start to have a different relationship with them. 369 00:27:54,240 --> 00:27:56,840 You can start to see the patterns, and the scale of them. 370 00:27:56,840 --> 00:27:59,600 Here in the corner, this little square - £37 billion. 371 00:27:59,600 --> 00:28:02,800 This was the predicted cost of the Iraq war in 2003. 372 00:28:02,800 --> 00:28:06,480 As you can see it's grown exponentially over the last few years 373 00:28:06,480 --> 00:28:10,560 and the total cost now is around about £2,500 billion. 374 00:28:10,560 --> 00:28:13,000 It's funny because when you visualise statistics 375 00:28:13,000 --> 00:28:15,360 you understand them, and when you understand them 376 00:28:15,360 --> 00:28:18,400 you can really start to put things in perspective. 377 00:28:23,960 --> 00:28:27,880 Visualisation is right at the heart of my own work too. 378 00:28:27,880 --> 00:28:30,160 I teach global health. 379 00:28:30,160 --> 00:28:33,840 And I know having the data is not enough - 380 00:28:33,840 --> 00:28:39,160 I have to show it in ways people both enjoy and understand. 381 00:28:39,160 --> 00:28:42,960 Now I'm going to try something I've never done before. 382 00:28:42,960 --> 00:28:45,960 Animating the data in real space, 383 00:28:45,960 --> 00:28:50,480 with a bit of technical assistance from the crew. 384 00:28:50,480 --> 00:28:52,240 So here we go. 385 00:28:52,240 --> 00:28:54,200 First, an axis for health. 386 00:28:54,200 --> 00:28:58,920 Life expectancy from 25 years to 75 years. 387 00:28:58,920 --> 00:29:01,440 And down here an axis for wealth. 388 00:29:01,440 --> 00:29:06,720 Income per person - 400, 4,000, 40,000. 389 00:29:06,720 --> 00:29:10,480 So down here is poor and sick. 390 00:29:10,480 --> 00:29:14,280 And up here is rich and healthy. 391 00:29:14,280 --> 00:29:18,320 Now I'm going to show you the world 392 00:29:18,320 --> 00:29:21,080 200 years ago, in 1810. 393 00:29:21,080 --> 00:29:22,880 Here come all the countries. 394 00:29:22,880 --> 00:29:26,200 Europe, brown; Asia, red; Middle East, green; 395 00:29:26,200 --> 00:29:29,440 Africa south of the Sahara, blue; and the Americas, yellow. 396 00:29:29,440 --> 00:29:33,760 And the size of the country bubble shows the size of the population. 397 00:29:33,760 --> 00:29:37,560 In 1810, it was pretty crowded down there, wasn't it? 398 00:29:37,560 --> 00:29:39,760 All countries were sick and poor. 399 00:29:39,760 --> 00:29:43,360 Life expectancy was below 40 in all countries. 400 00:29:43,360 --> 00:29:48,680 And only UK and the Netherlands were slightly better off. But not much. 401 00:29:48,680 --> 00:29:52,520 And now I start the world. 402 00:29:52,520 --> 00:29:56,840 The industrial revolution makes countries in Europe and elsewhere 403 00:29:56,840 --> 00:29:59,040 move away from the rest. 404 00:29:59,040 --> 00:30:02,280 But the colonized countries in Asia and Africa, 405 00:30:02,280 --> 00:30:04,040 they are stuck down there. 406 00:30:04,040 --> 00:30:08,200 And eventually the Western countries get healthier and healthier. 407 00:30:08,200 --> 00:30:13,320 And now we slow down to show the impact of the First World War 408 00:30:13,320 --> 00:30:15,880 and the Spanish flu epidemic. 409 00:30:15,880 --> 00:30:18,320 What a catastrophe! 410 00:30:18,320 --> 00:30:22,640 And now I speed up through the 1920s and the 1930s and, 411 00:30:22,640 --> 00:30:24,400 in spite of the Great Depression, 412 00:30:24,400 --> 00:30:27,800 Western countries forge on towards greater wealth and health. 413 00:30:27,800 --> 00:30:29,880 Japan and some others try to follow. 414 00:30:29,880 --> 00:30:32,560 But most countries stay down here. 415 00:30:32,560 --> 00:30:35,640 And after the tragedies of the Second World War, 416 00:30:35,640 --> 00:30:39,400 we stop a bit to look at the world in 1948. 417 00:30:39,400 --> 00:30:42,080 1948 was a great year. 418 00:30:42,080 --> 00:30:43,280 The war was over, 419 00:30:43,280 --> 00:30:48,000 Sweden topped the medal table at the Winter Olympics and I was born. 420 00:30:48,000 --> 00:30:51,280 But the differences between the countries of the world 421 00:30:51,280 --> 00:30:52,680 was wider than ever. 422 00:30:52,680 --> 00:30:54,960 United States was in the front. 423 00:30:54,960 --> 00:30:56,840 Japan was catching up. 424 00:30:56,840 --> 00:30:58,400 Brazil was way behind, 425 00:30:58,400 --> 00:31:03,040 Iran was getting a little richer from oil but still had short lives. 426 00:31:03,040 --> 00:31:05,160 And the Asian giants... 427 00:31:05,160 --> 00:31:08,720 China, India, Pakistan, Bangladesh, and Indonesia, 428 00:31:08,720 --> 00:31:11,360 they were still poor and sick down here. 429 00:31:11,360 --> 00:31:14,360 But look what was about to happen! Here we go again. 430 00:31:14,360 --> 00:31:18,640 In my lifetime, former colonies gained independence and then finally 431 00:31:18,640 --> 00:31:22,640 they started to get healthier and healthier and healthier. 432 00:31:22,640 --> 00:31:26,080 And in the 1970s, then countries in Asia and Latin America 433 00:31:26,080 --> 00:31:28,960 started to catch up with the Western countries. 434 00:31:28,960 --> 00:31:31,240 They became the emerging economies. 435 00:31:31,240 --> 00:31:32,640 Some in Africa follows, 436 00:31:32,640 --> 00:31:36,440 some Africans were stuck in civil war, and others were hit by HIV. 437 00:31:36,440 --> 00:31:41,840 And now we can see the world in the most up-to-date statistics. 438 00:31:42,840 --> 00:31:45,480 Most people today live in the middle. 439 00:31:45,480 --> 00:31:48,080 But there is huge difference at the same time 440 00:31:48,080 --> 00:31:51,520 between the best-off countries and the worst-off countries. 441 00:31:51,520 --> 00:31:54,520 And there are also huge inequalities within countries. 442 00:31:54,520 --> 00:31:59,000 These bubbles show country averages but I can split them. 443 00:31:59,000 --> 00:32:02,120 Take China. I can split it into provinces. 444 00:32:02,120 --> 00:32:05,120 There goes Shanghai... 445 00:32:05,120 --> 00:32:08,000 It has the same health and wealth as Italy today. 446 00:32:08,000 --> 00:32:11,240 And there is the poor inland province Guizhou, 447 00:32:11,240 --> 00:32:12,680 it is like Pakistan. 448 00:32:12,680 --> 00:32:18,800 And if I split it further, the rural parts are like Ghana in Africa. 449 00:32:19,800 --> 00:32:23,160 And yet, despite the enormous disparities today, 450 00:32:23,160 --> 00:32:27,240 we have seen 200 years of remarkable progress! 451 00:32:27,240 --> 00:32:31,720 That huge historical gap between the west and the rest is now closing. 452 00:32:31,720 --> 00:32:35,640 We have become an entirely new, converging world. 453 00:32:35,640 --> 00:32:37,960 And I see a clear trend into the future. 454 00:32:37,960 --> 00:32:40,840 With aid, trade, green technology and peace, 455 00:32:40,840 --> 00:32:43,720 it's fully possible that everyone can make it 456 00:32:43,720 --> 00:32:45,640 to the healthy, wealthy corner. 457 00:32:48,000 --> 00:32:51,360 Well, what you've just seen in the last few minutes 458 00:32:51,360 --> 00:32:56,520 is a story of 200 countries shown over 200 years and beyond. 459 00:32:56,520 --> 00:33:00,960 It involved plotting 120,000 numbers. 460 00:33:00,960 --> 00:33:02,560 Pretty neat, huh? 461 00:33:07,960 --> 00:33:13,120 So, with statistics, we can begin to see things as they really are. 462 00:33:13,120 --> 00:33:18,200 From tables of data to averages, distributions and visualisations, 463 00:33:18,200 --> 00:33:22,640 statistics gives us a clear description of the world. 464 00:33:22,640 --> 00:33:28,200 But, with statistics, we can not only discover WHAT is happening 465 00:33:28,200 --> 00:33:30,520 but also explore WHY, 466 00:33:30,520 --> 00:33:34,480 by using the powerful analytical method - correlation. 467 00:33:35,480 --> 00:33:38,400 Just looking at one thing at a time doesn't tell you very much. 468 00:33:38,400 --> 00:33:41,280 You've got to look at the relationships between things, 469 00:33:41,280 --> 00:33:43,360 how they change, how they vary together. 470 00:33:43,360 --> 00:33:45,360 That's what correlation is about. 471 00:33:45,360 --> 00:33:48,320 That's how you start trying to understand the processes 472 00:33:48,320 --> 00:33:50,960 that are really going on in the world and society. 473 00:33:52,480 --> 00:33:57,000 Most of us today would recognise that crime correlates to poverty, 474 00:33:57,000 --> 00:34:00,200 that infection correlates to poor sanitation, 475 00:34:00,200 --> 00:34:02,600 and that knowledge of statistics correlates 476 00:34:02,600 --> 00:34:05,040 to being great at dancing! 477 00:34:06,560 --> 00:34:10,200 Correlations can be very tricky. 478 00:34:10,200 --> 00:34:12,960 I got a joke about silly correlations. 479 00:34:12,960 --> 00:34:15,840 There was this American who was afraid of heart attack. 480 00:34:15,840 --> 00:34:19,920 He found out that the Japanese ate very little fat 481 00:34:19,920 --> 00:34:22,320 and almost didn't drink wine, 482 00:34:22,320 --> 00:34:25,520 but they had much less heart attacks than the Americans. 483 00:34:25,520 --> 00:34:28,640 But, on the other hand, he also found out that the French 484 00:34:28,640 --> 00:34:35,080 eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. 485 00:34:35,080 --> 00:34:40,840 So he concluded that what kills you is speaking English. 486 00:34:40,840 --> 00:34:43,920 # Smoke, smoke, smoke that cigarette 487 00:34:43,920 --> 00:34:48,000 # Puff, puff, puff and if you smoke yourself to death... # 488 00:34:48,000 --> 00:34:51,920 The time, the pace, the cigarette. Weights Tilt. 489 00:34:51,920 --> 00:34:56,200 The best example of a really ground-breaking correlation 490 00:34:56,200 --> 00:35:01,640 is the link that was established in the 1950s between smoking and lung cancer. 491 00:35:01,640 --> 00:35:07,040 Not long after the Second World War, a British doctor, Richard Doll, 492 00:35:07,040 --> 00:35:11,040 investigated lung cancer patients in 20 London hospitals. 493 00:35:11,040 --> 00:35:15,400 And he became certain that the only thing they had in common was smoking. 494 00:35:15,400 --> 00:35:18,280 So certain, that he stopped smoking himself. 495 00:35:18,280 --> 00:35:22,160 But other people weren't so sure. 496 00:35:22,160 --> 00:35:25,400 A lot of the discussion of the early data, 497 00:35:25,400 --> 00:35:29,120 linking smoking to lung cancer, said, "It's not the smoking, surely, 498 00:35:29,120 --> 00:35:32,600 "that thing we've done all our lives, that can't be bad for you. 499 00:35:32,600 --> 00:35:35,000 "Maybe it's genes. 500 00:35:35,000 --> 00:35:39,080 "Maybe people who are genetically predisposed to get lung cancer 501 00:35:39,080 --> 00:35:43,840 "are also genetically predisposed to smoke." 502 00:35:43,840 --> 00:35:47,360 "Maybe it's not the smoking, maybe it's air pollution - 503 00:35:47,360 --> 00:35:52,520 "that smokers are somehow more exposed to air pollution than non-smokers. 504 00:35:52,520 --> 00:35:56,280 "Maybe it's not smoking, maybe it's poverty." 505 00:35:56,280 --> 00:36:00,720 So now we've got three alternative explanations, apart from chance. 506 00:36:02,240 --> 00:36:06,760 To verify his correlation did imply cause and effect. 507 00:36:06,760 --> 00:36:10,680 Richard Doll created the biggest statistical study of smoking yet. 508 00:36:10,680 --> 00:36:14,680 He began tracking the lives of 40,000 British doctors, 509 00:36:14,680 --> 00:36:17,000 some of whom smoked and some of whom didn't, 510 00:36:17,000 --> 00:36:19,440 and gathered enough data 511 00:36:19,440 --> 00:36:22,000 to correlate the amount the doctors smoked 512 00:36:22,000 --> 00:36:24,920 with their likelihood of getting cancer. 513 00:36:24,920 --> 00:36:30,120 Eventually, he not only showed a correlation between smoking and lung cancer, 514 00:36:30,120 --> 00:36:35,800 but also a correlation between stopping smoking and reducing the risk. 515 00:36:35,800 --> 00:36:37,760 This was science at its best. 516 00:36:39,760 --> 00:36:44,000 What correlations do not replace is human thought. 517 00:36:44,000 --> 00:36:46,760 You've got to think about what it means. 518 00:36:46,760 --> 00:36:50,480 What a good scientist does, if he comes with a correlation, 519 00:36:50,480 --> 00:36:55,960 is try as hard as she or he possibly can to disprove it, 520 00:36:55,960 --> 00:37:00,200 to break it down, to get rid of it, to try and refute it. 521 00:37:00,200 --> 00:37:05,440 And if it withstands all those efforts at demolishing it 522 00:37:05,440 --> 00:37:10,760 and it is still standing up then, cautiously, you say, "We really might have something here." 523 00:37:26,720 --> 00:37:32,840 However brilliant the scientist, data is still the oxygen of science. 524 00:37:32,840 --> 00:37:39,320 The good news is that the more we have, the more correlations we'll find, the more theories we'll test, 525 00:37:39,320 --> 00:37:42,240 and the more discoveries we're likely to make. 526 00:37:46,160 --> 00:37:53,440 And history shows how our total sum of information grows in huge leaps as we develop new technologies. 527 00:37:53,440 --> 00:38:00,000 The invention of the printing press kicked off the first data and information explosion. 528 00:38:00,000 --> 00:38:06,000 If you piled up all the books that had been printed by the year 1700, 529 00:38:06,000 --> 00:38:11,200 they would make 60 stacks each as high as Mount Everest. 530 00:38:12,880 --> 00:38:15,360 Then, starting in the 19th century, 531 00:38:15,360 --> 00:38:19,880 there came a second information revolution with the telegraph, 532 00:38:19,880 --> 00:38:23,960 gramophone and camera. And later radio and TV. 533 00:38:23,960 --> 00:38:28,200 The total amount of information exploded. 534 00:38:28,200 --> 00:38:35,200 And by the 1950s the information available to us all had multiplied 6,000 times. 535 00:38:35,200 --> 00:38:41,440 Then, thanks to the computer and later the internet, we went digital. 536 00:38:41,440 --> 00:38:47,200 And the amount of data we have now is unimaginably vast. 537 00:38:49,920 --> 00:38:55,080 A single letter printed in a book is equivalent to a byte of data. 538 00:38:55,080 --> 00:38:58,720 A printed page equals a kilobyte or two. 539 00:39:01,960 --> 00:39:06,240 Five megabytes is enough for the complete works of Shakespeare. 540 00:39:08,000 --> 00:39:11,680 10 gigabytes - that's a DVD movie. 541 00:39:16,840 --> 00:39:23,360 Two terabytes is the tens of millions of photos added to Facebook every day. 542 00:39:24,880 --> 00:39:32,200 Ten petabytes is the data recorded every second by the world's largest particle accelerator. 543 00:39:32,200 --> 00:39:35,800 So much only a tiny fraction is kept. 544 00:39:35,800 --> 00:39:43,440 Six exabytes is what you'd have if you sequenced the genomes of every single person on Earth. 545 00:39:48,680 --> 00:39:50,520 But really, that's nothing. 546 00:39:50,520 --> 00:39:55,080 In 2009, the internet added up to 500 exabytes. 547 00:39:55,080 --> 00:40:02,120 In 2010, in just one year, that will double to more than one zettabyte! 548 00:40:06,360 --> 00:40:14,000 Back in the real world, if we turned all this data into print it would make 90 stacks of books, 549 00:40:14,000 --> 00:40:18,560 each reaching from here all the way to the sun! 550 00:40:18,560 --> 00:40:23,600 The data deluge is staggering, but, with today's computers 551 00:40:23,600 --> 00:40:28,200 and statistics, I'm confident we can handle it. 552 00:40:28,200 --> 00:40:31,400 When it comes to all the data on the internet, 553 00:40:31,400 --> 00:40:33,760 the powerhouse of statistical analysis 554 00:40:33,760 --> 00:40:37,560 is the Silicon Valley giant Google. 555 00:40:44,000 --> 00:40:50,600 The average person over their lifetime is exposed to about 100 million words of conversation. 556 00:40:50,600 --> 00:40:54,840 And so if you multiple that by the six billion people on the planet, 557 00:40:54,840 --> 00:40:58,040 that amount of words is about equal to the number of words 558 00:40:58,040 --> 00:41:01,080 that Google has available at any one instant in time. 559 00:41:03,480 --> 00:41:08,680 Google's computers hoover up and file away every document, web page, and image they can find. 560 00:41:08,680 --> 00:41:14,640 They then hunt for patterns and correlations in all this data, 561 00:41:14,640 --> 00:41:17,760 doing statistics on a massive scale. 562 00:41:17,760 --> 00:41:25,560 And, for me, Google has one project that's particularly exciting - statistical language translation. 563 00:41:25,560 --> 00:41:30,880 We wanted to provide access to all the web's information, no matter what language you spoke. 564 00:41:30,880 --> 00:41:33,520 There's just so much information on the internet, 565 00:41:33,520 --> 00:41:37,880 you couldn't hope to translate it all by hand into every possible language. 566 00:41:37,880 --> 00:41:41,560 We figured we'd have to be able to do machine translation. 567 00:41:44,280 --> 00:41:47,360 In the past, programmers tried to teach their computers 568 00:41:47,360 --> 00:41:53,320 to see each language as a set of grammatical rules - much like the way languages are taught at school. 569 00:41:53,320 --> 00:41:58,760 But this didn't work because no set of rules could capture a language 570 00:41:58,760 --> 00:42:01,480 in all its subtlety and ambiguity. 571 00:42:01,480 --> 00:42:05,840 "Having eaten our lunch the coach departed." 572 00:42:05,840 --> 00:42:07,920 Well, that's obviously incorrect. 573 00:42:07,920 --> 00:42:12,000 Written like that it would imply that the coach has eaten the lunch. 574 00:42:12,000 --> 00:42:15,160 It would be far better to say... 575 00:42:15,160 --> 00:42:19,920 "having eaten our lunch we departed in the coach." 576 00:42:19,920 --> 00:42:26,320 Those rules are helpful and they are useful most of time, but they don't turn out to be true all the time. 577 00:42:26,320 --> 00:42:30,320 And the insight of using statistical machine translation is saying, 578 00:42:30,320 --> 00:42:35,280 "If you've got to have all these exceptions anyways, maybe you can get by without having any of the rules. 579 00:42:35,280 --> 00:42:39,480 "Maybe you can treat everything as an exception." And that's essentially what we've done. 580 00:42:48,840 --> 00:42:52,640 What the computer is doing when he's learning how to translate 581 00:42:52,640 --> 00:42:55,160 is to learn correlations between words 582 00:42:55,160 --> 00:42:57,240 and correlations between phrases. 583 00:42:57,240 --> 00:43:00,840 So we feed the system very large amounts of data 584 00:43:00,840 --> 00:43:04,720 and then the system is seeing that a certain word or a certain phrase 585 00:43:04,720 --> 00:43:07,600 correlates very often to the other language. 586 00:43:09,800 --> 00:43:15,800 Google's website currently offers translation between any of 57 different languages. 587 00:43:15,800 --> 00:43:22,680 It does this purely statistically, having correlated a huge collection of multilingual texts. 588 00:43:22,680 --> 00:43:25,600 The people that built the system don't need to know Chinese 589 00:43:25,600 --> 00:43:29,800 in order to build the Chinese-to-English system, or they don't need to know Arabic. 590 00:43:29,800 --> 00:43:33,040 But the expertise that's needed is basically knowledge of statistics, 591 00:43:33,040 --> 00:43:35,840 knowledge of computer science, knowledge of infrastructure 592 00:43:35,840 --> 00:43:40,880 to build those very large computational systems that we are building for doing that. 593 00:43:42,880 --> 00:43:48,360 I hooked up with Google from my office in Stockholm to try the translator for myself. 594 00:43:48,360 --> 00:43:51,760 'I will type... some Swedish sentences.' 595 00:43:51,760 --> 00:43:53,080 OK. 596 00:43:53,080 --> 00:43:55,240 Sveriges... 597 00:43:55,240 --> 00:43:59,280 ..guldring i orat. 598 00:44:00,920 --> 00:44:07,400 OK. So it says, "Sweden's finance minister has a ponytail and a gold ring in your ear." 599 00:44:07,400 --> 00:44:11,520 I guess it probably means in his ear. 'That's exactly correct, it's amazing! 600 00:44:11,520 --> 00:44:15,400 'He comes from the Conservative party, that's the kind of Sweden we have today. 601 00:44:15,400 --> 00:44:18,520 'I will type one more sentence.' 602 00:44:18,520 --> 00:44:22,080 'I sitt samkonade...' 603 00:44:22,080 --> 00:44:25,600 partnerskap... 604 00:44:25,600 --> 00:44:28,280 nya biskop. 605 00:44:28,280 --> 00:44:35,200 "In his same-sex partnership has Stockholm's new bishop and his partners a three-year son." 606 00:44:35,200 --> 00:44:38,120 It's almost perfect, there's one important thing - 607 00:44:38,120 --> 00:44:41,800 it's HER, it's a lesbian partnership. 608 00:44:41,800 --> 00:44:46,760 OK, so those kinds of words his and her are one of the challenges 609 00:44:46,760 --> 00:44:49,080 in translation to get really those right. 610 00:44:49,080 --> 00:44:51,920 Especially when it comes to bishops one can excuse it! 611 00:44:51,920 --> 00:44:53,640 'Right, right.' 612 00:44:53,640 --> 00:44:58,520 I guess more often than not it would probably be a "his". 'I will write one more sentence.' 613 00:44:58,520 --> 00:45:01,720 Nar Sverige deltar I olympiader ar malet 614 00:45:01,720 --> 00:45:03,720 'inte att vinna utan att sla Norge.' 615 00:45:06,400 --> 00:45:11,960 OK. "When Sweden is taking part in Olympic goal is not to win but to beat Norway." 616 00:45:11,960 --> 00:45:13,640 'Yes! This is what it is! 617 00:45:13,640 --> 00:45:17,920 'But they are very good in Winter Olympics, so we can't make it, but we are trying.' 618 00:45:17,920 --> 00:45:19,960 Ah, very good, very good. 619 00:45:19,960 --> 00:45:24,960 'This is absolutely amazing, you know, and I was especially impressed 620 00:45:24,960 --> 00:45:30,520 'that it picks up words like "same-sex partnership" which are very new to the language." 621 00:45:30,520 --> 00:45:36,920 'The translator is good, but if they succeed with what's next, that'll be remarkable.' 622 00:45:36,920 --> 00:45:38,440 One of the exciting possibilities 623 00:45:38,440 --> 00:45:42,720 is combining the machine translation technology with the speech recognition technology. 624 00:45:42,720 --> 00:45:45,480 Now, both of these are statistical in nature. 625 00:45:45,480 --> 00:45:51,360 The machine translation relies on the statistics of mapping from one language to another, 626 00:45:51,360 --> 00:45:57,840 and similarly speech recognition relies on the statistics of mapping from a sound form to the words. 627 00:45:57,840 --> 00:45:59,520 When we put them together, 628 00:45:59,520 --> 00:46:03,200 now we have the capability of having instant conversation 629 00:46:03,200 --> 00:46:06,760 between two people that don't speak a common language. 630 00:46:06,760 --> 00:46:08,680 I can talk to you in my language, 631 00:46:08,680 --> 00:46:11,880 you hear me in your language and you can answer back. 632 00:46:11,880 --> 00:46:15,000 And in real time we can make that translation, 633 00:46:15,000 --> 00:46:18,800 we can bring two people together and allow them to speak. 634 00:46:31,400 --> 00:46:39,040 The internet is just one of many technologies created to gather massive amounts of data. 635 00:46:39,040 --> 00:46:43,640 Scientists studying our earth and our environment 636 00:46:43,640 --> 00:46:47,440 now use an incredible range of instruments 637 00:46:47,440 --> 00:46:50,920 to measure the processes of our planet. 638 00:46:52,760 --> 00:47:00,360 All around us are sensors continuously measuring temperature, water flow, and ocean currents. 639 00:47:00,360 --> 00:47:06,800 And high in orbit are satellites busy imaging cloud formations, forest growth and snow cover. 640 00:47:06,800 --> 00:47:11,360 Scientists speak of "instrumenting the earth". 641 00:47:13,320 --> 00:47:20,160 And pointing up to the skies above are powerful new telescopes mapping the universe. 642 00:47:30,280 --> 00:47:34,760 What's happening in astronomy is typical of how profoundly 643 00:47:34,760 --> 00:47:39,760 this new torrent of data is transforming science. 644 00:47:39,760 --> 00:47:45,280 Astronomers are now addressing many enduring mysteries of the cosmos 645 00:47:45,280 --> 00:47:49,600 by applying statistical methods to all this new data. 646 00:47:59,800 --> 00:48:03,360 The galaxy is a very big place and it's got billions of stars in it, 647 00:48:03,360 --> 00:48:09,400 and so to put together a coherent picture of the whole galaxy requires having an enormous amount of data. 648 00:48:09,400 --> 00:48:13,720 And before you could do a large sky survey with sensitive, digital detectors 649 00:48:13,720 --> 00:48:16,880 that meant that you could map many, many stars all at once, 650 00:48:16,880 --> 00:48:20,680 it was very difficult to build up enough data on enough of the galaxy. 651 00:48:24,600 --> 00:48:28,560 In the past, large surveys of the night sky had to be done 652 00:48:28,560 --> 00:48:32,400 by exposing thousands of large photographic plates. 653 00:48:32,400 --> 00:48:37,200 But these surveys could take 25 years or more to complete. 654 00:48:39,040 --> 00:48:44,680 Then, in the 1990s, came digital astronomy and a huge increase 655 00:48:44,680 --> 00:48:49,600 in both the amount and the accessibility of data. 656 00:48:49,600 --> 00:48:55,960 The Sloan Sky Survey is the world's biggest yet, using a massive digital sensor 657 00:48:55,960 --> 00:49:00,840 mounted on the back of a custom-built telescope in New Mexico. 658 00:49:00,840 --> 00:49:05,240 It's scanned the sky night after night for eight years, 659 00:49:05,240 --> 00:49:09,800 building up a composite picture in unprecedented resolution. 660 00:49:09,800 --> 00:49:14,840 The Sloan is some of the best, deepest survey data that we have in astronomy. 661 00:49:14,840 --> 00:49:18,760 Both on our own galaxy and on galaxies further away from ours. 662 00:49:24,080 --> 00:49:27,320 All the Sloan data is on the internet, 663 00:49:27,320 --> 00:49:34,120 and with it astronomers have identified millions of hitherto unknown stars and galaxies. 664 00:49:34,120 --> 00:49:37,480 They also comb the database for statistical patterns 665 00:49:37,480 --> 00:49:42,800 which will prove, disprove, or even suggest new theories. 666 00:49:42,800 --> 00:49:49,160 So we have this idea that galaxies grow, they become large galaxies like the one we live in, the milky way, 667 00:49:49,160 --> 00:49:55,880 not all at once, or not smoothly, but by continuously incorporating, 668 00:49:55,880 --> 00:49:59,160 basically cannibalising, smaller galaxies. 669 00:49:59,160 --> 00:50:04,000 They dissolve them and they become part of the bigger galaxy as it grows. 670 00:50:06,040 --> 00:50:12,520 It's a startling idea, and, in the Sloan data, is the evidence to support it. 671 00:50:12,520 --> 00:50:16,280 Groups of stars that came from cannibalised galaxies 672 00:50:16,280 --> 00:50:21,240 stand out in the Sloan data as statistically different from other stars 673 00:50:21,240 --> 00:50:24,280 because they move at a different velocity. 674 00:50:24,280 --> 00:50:28,680 Each big spike on one of these distribution graphs 675 00:50:28,680 --> 00:50:35,120 means Professor Rockosi has found a group of stars all travelling in a different way to the rest. 676 00:50:35,120 --> 00:50:38,360 They are the telltale patterns she's looking for. 677 00:50:40,240 --> 00:50:44,960 The evidence is accumulating that, in fact, this really is how galaxies grow, 678 00:50:44,960 --> 00:50:47,440 or an important way in which how galaxies grow. 679 00:50:47,440 --> 00:50:53,000 And so this is an important part of understanding how galaxies form, not only ours but every galaxy. 680 00:50:56,360 --> 00:51:00,400 The more data there is, the more discoveries can be made. 681 00:51:00,400 --> 00:51:03,320 And the technology is getting better all the time. 682 00:51:03,320 --> 00:51:07,560 The next big survey telescope starts its work in 2015. 683 00:51:07,560 --> 00:51:10,760 It will leave Sloan in the dust! 684 00:51:10,760 --> 00:51:16,160 Sloan has taken eight years to cover one quarter of the night sky. 685 00:51:17,680 --> 00:51:25,680 The new telescope will scan the entire sky, in even greater resolution, every three days! 686 00:51:34,120 --> 00:51:41,000 The vast amounts of data we have today allows researchers in all sorts of fields 687 00:51:41,000 --> 00:51:46,280 to test their theories on a previously unimaginable scale. 688 00:51:46,280 --> 00:51:53,600 But more than this, it may even change the fundamental way science is done. 689 00:51:53,600 --> 00:51:58,560 With the power of today's computers applied to all this data, 690 00:51:58,560 --> 00:52:03,880 the machines might even be able to guide the researchers. 691 00:52:14,600 --> 00:52:17,920 We're at a potentially profoundly important 692 00:52:17,920 --> 00:52:22,560 and potentially one of the most significant points in science, 693 00:52:22,560 --> 00:52:24,680 and certainly one of the most exciting, 694 00:52:24,680 --> 00:52:32,080 where the potential to transform not just how scientists do science but even what science is possible. 695 00:52:32,080 --> 00:52:34,680 And what will power that transformation 696 00:52:34,680 --> 00:52:38,400 of both how science is done and even what science is possible 697 00:52:38,400 --> 00:52:40,120 is going to be computation. 698 00:52:41,800 --> 00:52:49,440 Many of the dynamics of the natural world, like the interplay between the rainforests and the atmosphere, 699 00:52:49,440 --> 00:52:53,560 are so complex that we don't as yet really understand them. 700 00:52:53,560 --> 00:52:59,280 But now computers are generating literally tens of thousands of different simulations 701 00:52:59,280 --> 00:53:03,480 of how these biological systems might work. 702 00:53:03,480 --> 00:53:07,840 It's like creating thousands of hypothetical parallel worlds. 703 00:53:07,840 --> 00:53:10,640 Each and every one of these simulations 704 00:53:10,640 --> 00:53:18,360 is analysed with statistics to see if any are a good match for what is observed in nature. 705 00:53:18,360 --> 00:53:21,840 The computers can now automatically generate, 706 00:53:21,840 --> 00:53:26,240 test and discard hypotheses with scarcely a human in sight. 707 00:53:28,240 --> 00:53:35,120 This new application of statistics will become absolutely vital for the future of science. 708 00:53:35,120 --> 00:53:39,400 It's creating a new paradigm, if you like, 709 00:53:39,400 --> 00:53:42,640 in science, in the way in which we can do science, 710 00:53:42,640 --> 00:53:45,280 which is increasingly... 711 00:53:45,280 --> 00:53:51,160 Which one might characterise as... data-centric or data driven 712 00:53:51,160 --> 00:53:55,000 rather than being hypothesis-driven or experimentally-driven. 713 00:53:55,000 --> 00:53:58,240 So, it's exciting times in terms of the science, 714 00:53:58,240 --> 00:54:02,200 in terms of the computation and in terms of the statistics. 715 00:54:08,800 --> 00:54:15,480 Now, if all that sounds a bit abstract and theoretical to you, how about one final frontier? 716 00:54:15,480 --> 00:54:19,040 Could statistics even make sense of your feelings? 717 00:54:21,200 --> 00:54:25,800 In California - where else? - one computer scientist 718 00:54:25,800 --> 00:54:32,680 is harvesting the internet to try to divine the patterns of our innermost thoughts and emotions. 719 00:54:44,800 --> 00:54:46,360 This is the madness movement. 720 00:54:46,360 --> 00:54:50,960 The madness movement represents a skyscraper view of the world. 721 00:54:50,960 --> 00:54:54,880 Each of these brightly coloured dots is an individual feeling 722 00:54:54,880 --> 00:54:58,720 expressed by someone out there in a blog or a tweet. 723 00:54:58,720 --> 00:55:04,480 And when you click on the dot it explodes to reveal the underlying feeling of that person. 724 00:55:04,480 --> 00:55:07,080 This is what people say they're feeling today. 725 00:55:07,720 --> 00:55:10,160 Better...safe... 726 00:55:10,160 --> 00:55:12,040 crappy... 727 00:55:12,040 --> 00:55:14,560 well... 728 00:55:14,560 --> 00:55:18,440 pretty...special... 729 00:55:18,440 --> 00:55:20,800 sorry...alone... 730 00:55:25,560 --> 00:55:29,040 So, every minute, We Feel Fine crawls the world's blogs, 731 00:55:29,040 --> 00:55:34,120 takes all the sentences that start with the words "I feel" or "I am feeling", 732 00:55:34,120 --> 00:55:35,920 and puts them in a database. 733 00:55:35,920 --> 00:55:40,080 We collect all the feelings and we count the most common. 734 00:55:40,080 --> 00:55:43,320 They are better...bad... 735 00:55:43,320 --> 00:55:45,640 good...right... 736 00:55:45,640 --> 00:55:48,520 guilty...sick... 737 00:55:48,520 --> 00:55:51,680 the same...like shit... 738 00:55:51,680 --> 00:55:54,720 sorry...well... 739 00:55:54,720 --> 00:55:56,240 and so on. 740 00:55:58,320 --> 00:56:01,760 And we can take a look at any one feeling and analyse it. 741 00:56:01,760 --> 00:56:04,800 Right now a lot of people are feeling happy. 742 00:56:04,800 --> 00:56:11,320 We can take a look at all the people who are happy and break it down by age, gender or location. 743 00:56:11,320 --> 00:56:16,840 Since bloggers have public profiles we have that information and so we can ask questions like, 744 00:56:16,840 --> 00:56:21,400 "Are women happier than men?" or, "Is England happier than the United States?" 745 00:56:30,240 --> 00:56:33,120 We find that, as people get older, they get happier. 746 00:56:33,120 --> 00:56:40,560 And, moreover, we find that for younger people they associate happiness more with excitement, 747 00:56:40,560 --> 00:56:47,000 and, as people get older, they associate happiness more with peacefulness. 748 00:56:51,240 --> 00:56:57,760 And we also find that women feel loved more often than men, but also more guilty. 749 00:56:57,760 --> 00:57:02,480 While men feel good more often than women, but also more alone. 750 00:57:06,640 --> 00:57:12,480 As people lead more and more of their lives online, they leave behind digital traces, 751 00:57:12,480 --> 00:57:19,840 and with these digital traces we can begin to statistically analyse what it means to be human. 752 00:57:51,280 --> 00:57:54,480 So where does all of this leave us? 753 00:57:54,480 --> 00:58:00,160 We generate unimaginable quantities of data about everything you can think of. 754 00:58:00,160 --> 00:58:02,800 We analyse it to reveal the patterns. 755 00:58:02,800 --> 00:58:10,480 And now not only experts but all of us can understand the stories in the numbers. 756 00:58:18,160 --> 00:58:21,080 Instead of being led astray by prejudice, 757 00:58:21,080 --> 00:58:28,160 with statistics at our fingertips, our eyes can be open for a fact-based view of the world. 758 00:58:28,160 --> 00:58:33,760 So, more than ever before, we can become authors of our own destiny. 759 00:58:33,760 --> 00:58:36,800 And that's pretty exciting isn't it?! 760 00:58:37,680 --> 00:58:44,200 # 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 761 00:58:44,200 --> 00:58:50,800 # 1, 22, 3, 24, 25, 26, 27, 28, 9, 30, 31, 32, 3, 34, 35, 36, 7 762 00:58:50,800 --> 00:58:54,440 # 38, 39, 40, 41, 42, 3, 44, 45, 46, 47 763 00:58:54,440 --> 00:58:58,680 LYRICS DEGENERATE INTO GIBBERISH 764 00:59:08,680 --> 00:59:13,400 GIBBERISH DEGENERATES INTO NOISE 765 00:59:13,400 --> 00:59:14,440 # 100. #