July 29th, 2021
How a statistical glitch changed how I think about data.
This is Simon, a software engineer at Datawrapper. This week, I will show you how a statistical glitch in the German census changed how I think about data.
Around the year 2003, Germany started to shrink. When we look at the official statistics for this period, we see that the population slowly declined for several years. Then, in 2011, something strange happened: The population suddenly dropped by almost 1.5 million.
Like many European countries, Germany has experienced low birth rates for many years, which is why the country’s population was shrinking. Since 2012, the population has been growing again, mainly due to increased immigration. But what happened in 2011?
What we see here is what I call the census gap. Here’s how it happens: Most countries hold some form of a population census every couple of years. In between, public statisticians use other statistical methods to forward project population numbers. So when the German statistical offices conducted a population census in 2011, the result was just very, very different from what they had projected in the years before the census.
One reason for the big difference was the long time that had passed since the last census. The last full population census in the western part of Germany had been in 1987 and the last census in what was then the GDR had been in 1981 – a full 30 years before the 2011 one. Also, with the German unification in 1990 and its massive socio-economic impact, the country had undergone major transformations during this time.
To get a better understanding of the census gap, I calculated the difference between the 2011 census results and the 2010 projection for each German district and major city using column formulas and then used a scatter plot to compare the results:
We see that in 2011, most German districts and cities had a much lower population than everyone had assumed before the census. Also, while there were at least some winners in the former West German part of the country, there was a significant population loss in almost all cities and districts of the former East German states.
The lower numbers were a serious problem for many municipalities since much of their federal funding is directly dependent on population size. Berlin alone, a city of 3.3 million, “lost” more than 150.000 people, which is said to amount to a financial loss of almost half a billion Euros in federal funding each year. But the biggest losers were the cities of Mannheim and Flensburg, which both saw a population drop of more than 7%.
Because so much time had passed since the last census, the projections before 2011 were largely based on data from population registers and statistical methods which turned out to be rather imprecise. But the 2011 census also received some criticism. It was Germany’s first attempt to conduct a register-based census, a methodology that some critics considered unreliable. Several hundred cities filed complaints against the census results — although without success. But even today, with the 2021 census around the corner, at least one major case is still pending in court: The city of Flensburg, one of the big losers of the 2011 census, still fights to get their old population numbers back.
In the case described here, the population figures prior to the census of 2011 were off. And to make it worse, the 2011 census results aren’t perfect either. Still, it is the best data we have. Everyone uses it, including policymakers and public administrators.
This story is a reminder that data is never a perfect representation of reality. When you use a particular data set in your work, it is essential that you understand how, why, and by whom the data was collected. This also applies to official statistics, which are usually reliable but can also have very specific issues and limitations. How you can learn about such specifics? First of all, read the footnotes. When a data set comes with footnotes or other metadata, make sure you have a look. Common sense also helps. If the figures don’t add up, have gaps or weird outliers, you should probably do some research and learn about the background of the data set. To be a responsible data user, always make sure you understand the context and the methodology behind the numbers you use.
That’s it from me for this week. As always, do let me know if you have feedback, suggestions, or questions. I am looking forward to hearing from you at email@example.com, Mastodon, or Twitter. We’ll see you next week!
The number of natural citizens (people with a German passport) is still going down due to demographic change, but the total population of Germany has increased since the early 2010s, mostly due to migration. For more information on migration and demographic change in Germany and Europe, see this brief explanation by Frank Swiaczny.
I am not an expert in official statistics, only a nerd who happens to be interested in open data. If you know a better term for the kind of statistical glitch described here, let me know.
The exact technical term used by German statisticians for projected population numbers is ‘Bevölkerungsfortschreibung’. I have downloaded the projected numbers and the 2011 census results from Datenguide, an experimental data portal for German statistics. Full disclosure: I am a co-founder and maintainer of the Datenguide project.
The 2011 census was mostly based on data from several administrative sources, such as population registers and the federal employment agency. In a second step, a sample of the German population was surveyed to verify the data. Most criticisms of the census concern the size of the population sample, which was perceived by some as too small. Björn Schwentker wrote a more detailed explanation of the issue for Der Spiegel.