World Forename & Surname Distribution Maps

About Forebears Names

Forebears Names (accessible here and here) is a free service providing access to the largest geospatial database of forename and surname distribution and demographics. It provides the approximate incidence of forenames and surnames produced from a database of 4,044,546,938 people (55.5% of living people in 2014). As of September 2019 it covers 27,662,801 forenames and 27,206,821 surnames in 236 jurisdictions. The geospatial data can be viewed on an interactive map and in table form. Statistics can be viewed in a global, continental, georegional, national and (multi-)regional scope.

Brief History

Forebears Names was introduced to Forebears in June 2012, with the launch of the website. At that time the scope was limited to England, Scotland, Wales and The Channel Islands; covering around 425,000 surnames listed in the 1881 census of The United Kingdom. In April 2013 over 64,000 surnames were added with the addition of data from the Ireland census of 1901. This initial version was presented using HighMaps.

From April 2013 Forebears began the long process of compiling data for the first global surname mapping facility. At this time the most expansive source was Public Profiler's Worldnames project, which covers twenty-six countries with a sample of 300 million people. Due to multiple data sources; differences in formating, writing scripts; and a whole host of other problems, the facility was not launched until September 2014. The data was derived from a sample of 1,587,475,724 people, covering 227 sovereign states, dependencies and territories. An update occurred a few months later, fixing a number of issues and adding a small amount of new data.

Immediately work commenced on an expansive update, including more jurisdictions, more depth and a larger sample. The update was initially projected for February 2016, but tasks always took longer than anticipated and new tasks regularly presented themselves. As there was no standard format to the data sources being used, extracting them and arranging them in a universal manner took up to two months in the case of one country.

Once the data was compiled for the update it took a further six months to correctly assign individuals to a identifiable administrative divisions. This was owing to the source data being from hundreds of individual sources that didn't use a universal way of denoting location.

A further two months were required to re-build the geospatial statistics generation script, a new website and mapping interface. The second version of global surname mapping was released on the 5th of September 2018. This build was from a sample of 3,936,342,242 people, covering over 26,211,602 surnames and 236 jurisdictions. This update saw the visual presentation of data move to Leaflet. This update saw the addition of first level administrative divisions for many countries, which was later bolstered with second level administrative divisions.

Work on another major update commenced in December 2018, focusing on the addition of forename data. It took longer than expected owing to using disparate data sources that took time to fashion into a single format. This is the current version of th project, produced from a global sample of 4,044,546,938 people; and covering 27,662,801 forenames and 27,206,821 surnames in 236 jurisdictions. This was released in late August 2019 and included

In early 2019 Forebears began a process of adding demographic data for surnames, such as the distribution of religious faith and average income. This was added for forenames in August 2019.

Going forward the emphasis of the project is to increase the data sample, develop and expand an API that predicts demographic factors from a name input, more demographic data and to a lesser extent add historic data.

Forebears Names data has been used by publicly traded companies, banks, national security contractors, marketers, The Federal Reserve and has been cited in over 60 academic studies.

Process

The creation of geospatial data for names has three stages: the extraction of data from sources and conversion to a universal database format, sanitising the resultant databases and referencing people to geographic regions and the compilation of the geospatial data itself.

1) Database Creation

The first stage in the process of producing geospatial data is the importation of data sources (of which there are over 350) to an individual database table in a universal format. The basic format is: forename, middle name and surname. The source data has come in many formats. Some easy to import, such as CSVs, Excel spreadsheets, database dumps and standalone databases. While others have been problematic, specifically PDFs, of which around 40-50 million pages have been parsed. The character encoding of each source is checked and if need be converted to UTF-8, which is the encoding used for all data.

2) Sanitisation and Geospatial Referencing

Once a source has been imported to a database table, various facets of its data are sanitised and assessed for their integrity. Specifically the name is sanitised to remove any character other than Latin alphabetic characters, hyphens (-), spaces ( ) and apostrophes ('). Various changes are made to fix common errors, such as the name McDonald appearing as “Mc Donald”; names beginning with “Dr”, “Mr”, “Mrs” etc. and names beginning with hyphens.

In some cases source data has only included a single name string and not a specifically defined forename and surname. In these cases the name parts are extracted, taking into account particles, such as “de la”, “bin” and “van”.

Forenames are assessed to ascertain if they contain more than one name and any extra components are assigned to a middle name. They may only contain one name, including particles in names like “Abd Rahaman” and “La Toya”. One current exception is when the forename was derived from a writing script other than Latin and it was determined the forename should have a space. This will be changed in a future update.

Forenames are always derived from the first part of a given name. In some cases the forename is an initial, in which case the initial is assigned to the middle name and the forename is blank.

Surnames from the Spanish and Portuguese traditions, where individuals usually have a surname from their mother and father, are stored separately so far as Forebears has been able to discern from the source data.

Multiple sources were obtained in a corrupted format, specifically diacritic marks (or accents). In these cases the data was recovered with reference to other sources for the country in question.

Some sources encoded names in ASCII without diacritic marks, when there should be. In such cases diacritics have been inserted as they should be.

A limited number of sources had a significant minority of names back-to-front (surname as forename and vice versa). In these cases Forebears has correctly arranged the names as much as possible. It is also an occasional human data entry error in any database.

Some sources were in writing scripts other than Latin. This presents an issue in that it is not known how each individual may convert their name into the Latin alphabet. This process of conversion is known as transliteration or Romanisation. Further the majority will not have a Latin rendering of their name. The solution Forebears has used, as much as possible, is to use the most prevalent trends in transliteration to systematically convert all names in a given writing script to Latin.

Forebears uses the following methods for transliteration:

Arabic (Hassaniya): conversion tables, Government of Mauritania
Arabic (standard): Forebears proprietary
Armenian: ICU modified
Azerbaijani: 'ə$' => 'e', 'Ə' => 'A', 'ə' => 'a'
Bengali: conversion tables, Government of West Bengal
Bulgarian: Forebears proprietary
Burmese: Forebears proprietary
Chinese: ICU
Dhivehi: conversion tables, Government of The Maldives
Farsi: Forebears proprietary
Georgian: conversion tables, Government of Georgia
Greek: Forebears proprietary
Gujarati: conversion tables, Government of Gujarat
Hebrew: Forebears proprietary
Hindi: conversion tables, Governments of Uttar Pradesh and Rajasthan
Japanese: jTalk
Kannada: ICU modified
Khmer: Forebears proprietary
Korean: ICU
Macedonian: Forebears proprietary
Marathi: conversion tables, Government of Maharashtra
Mongolian: Forebears proprietary
Nepali: ICU modified
Oriya: conversion tables, Government of Odisha
Russian: Forebears proprietary
Serbian: Forebears proprietary
Thai: RTGS
Tibetan: conversion tables, Government of Bhutan
Ukrainian: Forebears proprietary
Urdu: Forebears proprietary
Uzbek: Forebears proprietary

The gender of individuals are sanitised to only include male, female and in a minority of cases, other. X and Y are used to denote the gender of individuals who appear with no forename or their forename is an initial. This is to maintain the sex ratio of usable forenames for producing statistics.

Dates of birth are checked to be valid and within a reasonable time period (i.e. not born in 1500).

These are the basic functions that are regularly performed on data. Many sources required specific attention, such as extracting names from elaborate strings including patronymic and matrimonial references and Hungary where many women appeared with their husband's forename and a suffix denoting “wife of”.

In countries where many people do not have surnames (Indonesia, Myanmar) the part of the name that would be used to create a surname from in a Western context has been considered as a surname.

As of September 2019, 145 jurisdictions appear with at least one level of administrative divisions within it. For example within The United States name distribution statistics can be viewed at a state and county/independent city level. Assigning individuals to administrative divisions was often simple, as many sources delineated individuals as such. Others had to be inferred from postal code and/or city, which was not always a simple task due to changes in postal codes, administrative boundaries and a variety of other issues. Administrative divisions are assigned from GeoPostcodes's global postal code database.

Once administrative divisions are assigned, the resultant taxonomy is verified against GeoPostcodes' data and other sources to ensure there are no omissions, duplications or erroneous additions.

The incidence of a name in a jurisdiction's administrative division may be lower than in the jurisdiction owing to some individuals not being assigned to a division.

A small minority of administrative divisions are missing from the source data and appear as empty.

The percentage of each administrative division's population that is represented can vary.

Forebears have assigned administrative divisions based on those used at the time individuals were referenced to administrative divisions. Administrative divisions will not be updated to account for future changes.

A small number of individuals are not assigned to a place owing to insufficient or ambiguous geographic references. A number of people who could be individually assigned to administrative divisions but would require being individually catalouged have not been assigned to divisions, owing to it being a extremely inefficient use of resources that would have considerably delayed the project.

3) Compilation

This process is followed for each jurisdiction.

1) Firstly individuals are grouped by diacritic-sensitive name by their lowest level administrative division (so in The United States, counties) or by no division if they are not assigned to one. When compiling for forenames gender is added to the grouping. Only the lowest administrative division for each individual is used because some jurisdictions do not have a universal structure for delineating divisions.

1a) Due to an unequal gender ratio in data for Macedonia, Tajikistan, Turkmenistan and Uzbekistan surnames are adjusted to the sex ratio of the country.

1b) A small number of jurisdictions with a small sample have the names immigrants moved to another table, so their incidence is not scaled up to find the approximate number of people with that name.

1c) Due to an over-representation of guest workers in the following countries: Bahrain, Hong Kong, Kuwait, Macau , Oman, Qatar, Saudi Arabia, Singapore and Taiwan, the incidence of names is modified to be in line with the representation of various ethnicities.

1d) Western forenames of Chinese people in China, Hong Kong, Macau, Singapore and Taiwan are ignored, e.g. Toby Ng.

2) The built data is re-combined to be case-insensitive.

3) When building forenames empty forenames are removed prior to building statistics.

4) When building forename statistics each administrative division (or the entire jurisdiction if no division) is assessed against the sex ratio of the country and the incidence of forenames is adjusted to bring it in line with the jurisdiction's sex ratio if need be.

5) When building forename statistics names are merged to combine incidence for the same name with different genders, e.g. males and females with the name Alex.

6) The population of each administrative division within the current jurisdiction is called from a database table. This is used to find the multiplier the sample for each administrative division needs to be adjusted by. Those within divisions are adjusted, while any not within division are left as it.

7) Any names ignored in step 1b) are reintroduced.

8) When building surname statistics any blank surnames are deleted. They are deleted at this stage as there are a number of countries where many people have no surname, such as India.

9) If the current jurisdiction has administrative divisions the built statistics are now used to create the higher level administrative divisions (including the jurisdiction) from the lowest division.

10) With statistics built the percentage share of all names and the rank of each name is calculated for each administrative division and the jurisdiction. The ordinal ranking method is been used to produce the rankings. The method ranks the name that occurs most in the area first. Name are then ranked in descending order of their incidence with an increment of one. When two or more names occur the same number of times, they share the same rank. Successive rank is incremented by the total preceding name.

Surname	Incidence	Rank
Wang	100	1
Li	90	2
Chong	90	2
Chen	80	4

With each jurisdiction built, they are compiled to produce the incidence, percentage of all names and rank at a global, continental, georegional (e.g. Western Europe) and onoregional* level. Finally each name is assessed to determine the country in which it has the highest incidence and is most numerous compared to other names.

*Onoregions are regions delineated by Forebears denoting areas within georegions that share similar naming traditions.

Limitations

The primary limitation of Forebears Names is the inability to obtain data on all living individuals. Approximations derived from a small percentage of a population miss many names and can produce moderate inaccuracies in rankings in balanced samples, much larger in imbalanced samples. Forebears seek to address this by continually seeking new source data. However, Forebears is the largest geospatial names database, produced from a sample eight times larger than the nearest comparable service. As such Forebears provides the most comprehensive data for most jurisdictions; and the only data in many cases. There is also very little publicly available data on the distribution of forenames. To Forebears' knowledge there are three country-level services built from a larger sample than Forebears, which are listed in Appendix III.

2) Beyond censuses, which are typically conducted every ten years and not made publicly available, the currency of sources varies. The most commonly used source is voter lists, which cover most or a large portion of a country's adult population. However these include deceased people, sometimes in small but notable quantities. They also may not be updated after someone moves.

3) Some sources may be biased towards certain ethnic groups or those with higher incomes, which is not distributed equally by name.

4) Source data may contain human data-input errors and some data is self-reported, which may include names like “fghfghfghf” or “Jones Brothers Ltd”. Where these have been identified they have been removed.

5) Due to human data-input errors names occasionally occur back to front, e.g. Smith as a forename and John as a surname. In some databases from developing countries this was more common and in those cases Forebears attempted to rectify the issue as much as is possible.

6) Some sources are biased towards certain age groups, either the young (5-18), adults (18+) or those more economically active (25/30-60/70). This will most notably cause inaccuracies in forename distribution, as trends in naming babies can move dramatically over a generation. It will also affect accuracies in surname distribution where immigration is a factor.

Appendix I: Sample Sizes

Below is a table showing the percentage of each jurisdiction's population that appears in Forebears Names' source data.

Country	Sample Size (%)
Georgia	100
Spain	100
Israel	100
Armenia	100
Czech Republic	100
United States	100
Pitcairn Islands	100
Ukraine	99.6281
Taiwan	99.6114
Abkhazia	98.9871
Sweden	97.2559
Bulgaria	96.517
Kosovo	96.4718
China	94.6195
Saint Lucia	94.2909
Norway	93.5677
South Korea	92.9869
Trinidad and Tobago	91.5853
Slovenia	89.6028
Finland	88.354
Chile	88.0213
Indonesia	87.6476
Anguilla	86.1433
Saint Vincent and The Grenadines	85.9841
Poland	83.9834
Marshall Islands	79.4265
Lesotho	78.8843
Philippines	76.5383
Cook Islands	73.0529
Peru	72.0211
Costa Rica	71.8452
Monaco	71.4844
Scotland	70.9169
Croatia	69.1766
United States Virgin Islands	68.9192
Mexico	68.8096
Turkey	66.8737
Grenada	66.7275
Norfolk Island	66.4639
Maldives	65.8779
Guyana	65.5974
Nauru	65.2068
Lebanon	65.2032
Saint Kitts and Nevis	65.1228
Argentina	64.9276
Venezuela	64.7
Iceland	63.928
Puerto Rico	63.6864
Saint Pierre and Miquelon	62.4733
El Salvador	62.4351
India	62.3176
Nicaragua	61.8202
Jersey	60.9373
Panama	60.911
Slovakia	60.4236
British Virgin Islands	58.8361
England	57.8076
Azerbaijan	57.2106
Canada	56.9876
Denmark	56.8946
Cayman Islands	56.6993
Wales	55.6053
Papua New Guinea	54.6311
Australia	54.0804
Honduras	52.9872
Cape Verde	52.9174
Sao Tome and Principe	51.9546
Uruguay	51.3979
Cambodia	51.1247
Kyrgyzstan	50.9016
Bhutan	50.4298
Russia	49.0728
Liechtenstein	48.1295
Montserrat	48.0699
Brazil	47.4398
Saint Helena Ascension and Tristan Da Cunha	47.1741
Switzerland	46.5344
Bermuda	45.4084
Macedonia	45.1393
Belarus	44.9081
Belize	44.7883
Montenegro	44.6058
Netherlands	44.5946
Moldova	44.5832
Pakistan	44.4554
San Marino	44.3631
Benin	44.3528
Nepal	44.0495
Isle of Man	43.9562
Paraguay	43.45
New Zealand	42.6687
Vietnam	42.1542
American Samoa	42.0328
South Ossetia	41.1361
South Africa	40.8694
Palestine	40.6801
Turks and Caicos Islands	40.4197
Niger	39.9727
Belgium	39.9002
Northern Ireland	39.816
Colombia	39.6764
Niue	39.4296
Senegal	38.9956
Liberia	38.6844
Nigeria	38.4484
Uganda	37.9698
Gibraltar	37.6763
Northern Mariana Islands	37.0039
Germany	36.8024
Antigua and Barbuda	36.608
Latvia	36.2917
Barbados	35.2443
Mauritania	32.4168
Austria	32.3281
Ireland	32.3002
Jordan	32.077
Solomon Islands	32.0136
Ecuador	31.5766
Zimbabwe	31.4489
Andorra	31.3692
Luxembourg	31.2874
Greenland	30.8668
Iran	30.8147
Jamaica	30.6644
Transnistria	30.6377
Ivory Coast	30.1093
France	29.7655
Falkland Islands	29.2581
Cyprus	28.9219
Estonia	28.3213
Hungary	27.4044
Guam	26.9932
United Arab Emirates	26.7623
Bosnia and Herzegovina	26.6208
Serbia	26.3722
Aruba	25.9586
Burkina Faso	25.8066
Cameroon	25.7179
New Caledonia	24.4795
Greece	23.8838
Yemen	23.4968
Italy	23.2548
Bahamas	22.074
Kazakhstan	21.8617
French Polynesia	20.754
Singapore	20.0904
Dominica	19.9777
Guernsey	19.904
Botswana	19.5054
Iraq	19.2098
Malta	18.9377
Oman	18.8746
Lithuania	18.8726
Zambia	18.6179
Mauritius	17.981
Saint Barthelemy	17.5029
Suriname	16.9967
Mongolia	16.4875
Namibia	15.7669
Romania	15.3673
DRCongo	15.2121
Malaysia	15.0674
Algeria	14.2567
Dominican Republic	13.9566
Japan	13.8864
Brunei	13.6021
Faroe Islands	12.8317
Seychelles	12.1355
Micronesia	11.4133
Portugal	11.1799
Kuwait	11.1786
Thailand	11.0061
Qatar	10.0262
Kenya	9.7568
Wallis and Futuna	9.5775
Albania	9.2685
Hong Kong	8.439
Haiti	8.2689
Tuvalu	7.9927
Palau	7.6766
Bahrain	7.6359
Guatemala	7.4689
Tanzania	7.0847
Saint Martin	6.9815
Cuba	6.0307
Bolivia	5.8926
Kiribati	5.555
Vanuatu	5.4279
Tonga	5.0643
Tunisia	5.0363
Syria	4.7243
Fiji	4.7199
Djibouti	4.5811
Swaziland	4.393
Malawi	4.2594
Samoa	4.0158
Macau	3.9076
Morocco	3.5919
Gabon	3.4761
Uzbekistan	2.8366
Saudi Arabia	2.7967
Somalia	2.2936
Afghanistan	2.2096
Tajikistan	2.0175
Ghana	1.7606
Northern Cyprus	1.6708
Sri Lanka	1.6422
Mali	1.6245
Turkmenistan	1.5976
Togo	1.3964
Bangladesh	1.1586
Gambia	0.9721
Egypt	0.9354
Equatorial Guinea	0.8192
Libya	0.763
Ethiopia	0.7492
Angola	0.7433
Rwanda	0.5641
Sudan	0.5336
Comoros	0.4863
East Timor	0.4733
Madagascar	0.4584
Congo	0.4549
Myanmar	0.4233
Guinea	0.3532
South Sudan	0.3144
Mozambique	0.2871
Sierra Leone	0.2744
Burundi	0.2134
Laos	0.2033
Guinea Bissau	0.1708
Chad	0.1692
Central African Republic	0.1564
Eritrea	0.1473
North Korea	0.0095

Appendix II: Sources

Owing to its propriety nature, Forebears does not cite sources used unless required to by law or the data is historic.

Partial list of sources:

Hagstova Føroya. (2015). Boys names 2001-2014. Retrieved from URL
Hagstova Føroya. (2015). Female names 2001-2014. Retrieved from URL
Hagstova Føroya. (2015). Surnames 2001-2014. Retrieved from URL
Instituto Nacional de Estadistica. (2018). Frecuencias de apellidos. Retrieved from URL
Instituto Nacional de Estadistica. (2018). Frecuencias de nombres. Retrieved from URL
Ministerstvo Vnitra České Republiky. (2016). Četnost jmen a příjmení. Retrieved from URL
Statistični urad Republike Slovenije. (2018). Imena dečkov, Slovenija, letno.
Statistični urad Republike Slovenije. (2018). Imena deklic, Slovenija, letno.
Statistični urad Republike Slovenije. (2018). Priimki, Slovenija, letno.
The Church of Jesus Christ of Latter Day Saints. (2001). 1880 United States Census.
The Church of Jesus Christ of Latter Day Saints. (2001). 1881 British Census.
The National Archives of Ireland. (2007). Ireland 1901 Census.
전자가족관계등록시스템. (2019). 아기 이름 빈도.
통계청. (2015). 성, 가족 기원 및 종교 관련 항목 조사

Appendix III: Services With a Larger Sample

The following services provide surname distribution statistics with a larger sample than Forebears.

Belgium: Familienaam provides distribution data from Belgium's 1998 and 2008 population registers.

France: Filae.com provides distribution data based on birth data from l'Institut National de la Statistique et des Études Économiques. This data is based on births and not where people were living in a given year.

Netherlands: The Meertens Instituut provides distribution data on all people with Dutch nationality who lived in The Netherlands in 2007.

Names - Distribution

Forebears provides data on name distribution and is not involved in the delineation of nations. The website does not specifically define any of the following disputed territories as nations or part of nations:

Abkhazia, Artsakh, Ceuta and Melilla, Crimea, Golan Heights, Israel, Kashmir, Kosovo, Northern Cyprus, Palestine, Sahrawi Republic, Somaliland, South Ossetia, Taiwan and Transnistria.

They all appear as they are controlled de facto. This decision was taken as the records Forebears uses are obtained from sources within the de facto jurisdiction; the de jure or other claimants have few or no record as to the names of who lives in them. The administrative divisions within disputed territories also often differ form the de jure administrative divisions. This is current as of April 2014.

The delineations have not been made for political reasons and Forebears does not comment on boundary disputes.

The Donetsk People's Republic and The Luhansk People's Republic are not included as they were not established at the time of creation; and it is unclear of the situation in this area. The Autonomous Administration of North and East Syria is not included as no data has been obtained covering its boundaries. It appears as part of Syria.

Forebears uses a database of around four billion people to produce statistics relating to forenames and surnames. Thus they are approximations.

Due to the amount of resources it would require to keep an accurate record of everyone alive Forebears does not seek to make edits to the underlying data.

Names containing diacritical marks (or accents) will be considered different to those that don't. For example the name Öztürk (568,848 bearers) is treated differently to Ozturk (7,920 bearers).

Names may contain any character in the Latin Unicode range as well as apostrophe ('), hyphen (-) and space ( ). These are also considered differently. For example, Jones-Williams is considered different to Jones Williams.

Due to the copyright status of various sources used and Forebears' own privacy concerns, no details on living people will be disbursed under any circumstances. If you wish to locate people with a given name, it is recommend you consult white pages or hire a private investigator.

The source data used to compile incidence does not always list a place, or region of residence. This generally relates to prisoners, military, police and foreign nationals living in a country; and to a lesser degree people whose specified place of residence could not be determined. For this reason a country may list, say ten people with a name, but only a total of nine in all the regions of the country.

As of the 4th of September 2018 the Forebears database contains 26,445,869 surnames, of which 234,267 are considered extinct.

The previous edition of the database released on the 15th of September 2014 included 11,303,059 surnames. Before that the database released on the 26th of April 2013 included 488,661 surnames. The initial database launched on the 20th of June 2012 included 424,349 surnames.

Forebears does not remove names from the website. Names cannot be owned and the factual information relating to the meaning or distribution of a surname cannot be subject to copyright.

Names - Lexical

Some data sources used to produce the name statistics contain both a Latin and non-Latin rendering of a name. In these cases the Latin rendering was used, as is
Latin and non-Latin data was used for a number of countries
In the case that a standardised transliteration method was used, non-Latin forms were consistently transliterated to the same Latin rendering

Data / API

Yes. Data can be accessed via API, CSV/Excel upload and web interface at OnoGraph.

Forebears does not assist those who want to mine data from the website. Due to huge levels of data mining that peaked at over 75% of total requests to the website, the site now uses a hard firewall to block such requests and a soft one to return random data when unauthorised robot access is detected.

No; though a commercially available API is planned.