About Forebears Names

Forebears Names (accessible here and here) is a free service providing access to the largest geospatial database of forename and surname distribution and demographics. It provides the approximate incidence of forenames and surnames produced from a database of 4,044,546,938 people (55.5% of living people in 2014). As of September 2019 it covers 27,662,801 forenames and 27,206,821 surnames in 236 jurisdictions. The geospatial data can be viewed on an interactive map and in table form. Statistics can be viewed in a global, continental, georegional, national and (multi-)regional scope.

Brief History

Forebears Names was introduced to Forebears in June 2012, with the launch of the website. At that time the scope was limited to England, Scotland, Wales and The Channel Islands; covering around 425,000 surnames listed in the 1881 census of The United Kingdom. In April 2013 over 64,000 surnames were added with the addition of data from the Ireland census of 1901. This initial version was presented using HighMaps.

Initial mapping facility

From April 2013 Forebears began the long process of compiling data for the first global surname mapping facility. At this time the most expansive source was Public Profiler's Worldnames project, which covers twenty-six countries with a sample of 300 million people. Due to multiple data sources; differences in formating, writing scripts; and a whole host of other problems, the facility was not launched until September 2014. The data was derived from a sample of 1,587,475,724 people, covering 227 sovereign states, dependencies and territories. An update occurred a few months later, fixing a number of issues and adding a small amount of new data.

Immediately work commenced on an expansive update, including more jurisdictions, more depth and a larger sample. The update was initially projected for February 2016, but tasks always took longer than anticipated and new tasks regularly presented themselves. As there was no standard format to the data sources being used, extracting them and arranging them in a universal manner took up to two months in the case of one country.

Once the data was compiled for the update it took a further six months to correctly assign individuals to a identifiable administrative divisions. This was owing to the source data being from hundreds of individual sources that didn't use a universal way of denoting location.

A further two months were required to re-build the geospatial statistics generation script, a new website and mapping interface. The second version of global surname mapping was released on the 5th of September 2018. This build was from a sample of 3,936,342,242 people, covering over 26,211,602 surnames and 236 jurisdictions. This update saw the visual presentation of data move to Leaflet. This update saw the addition of first level administrative divisions for many countries, which was later bolstered with second level administrative divisions.

Current mapping facility

Work on another major update commenced in December 2018, focusing on the addition of forename data. It took longer than expected owing to using disparate data sources that took time to fashion into a single format. This is the current version of th project, produced from a global sample of 4,044,546,938 people; and covering 27,662,801 forenames and 27,206,821 surnames in 236 jurisdictions. This was released in late August 2019 and included

In early 2019 Forebears began a process of adding demographic data for surnames, such as the distribution of religious faith and average income. This was added for forenames in August 2019.

Going forward the emphasis of the project is to increase the data sample, develop and expand an API that predicts demographic factors from a name input, more demographic data and to a lesser extent add historic data.

Forebears Names data has been used by publicly traded companies, banks, national security contractors, marketers, The Federal Reserve and has been cited in over 60 academic studies.

Process

The creation of geospatial data for names has three stages: the extraction of data from sources and conversion to a universal database format, sanitising the resultant databases and referencing people to geographic regions and the compilation of the geospatial data itself.

1) Database Creation

The first stage in the process of producing geospatial data is the importation of data sources (of which there are over 350) to an individual database table in a universal format. The basic format is: forename, middle name and surname. The source data has come in many formats. Some easy to import, such as CSVs, Excel spreadsheets, database dumps and standalone databases. While others have been problematic, specifically PDFs, of which around 40-50 million pages have been parsed. The character encoding of each source is checked and if need be converted to UTF-8, which is the encoding used for all data.

2) Sanitisation and Geospatial Referencing

Once a source has been imported to a database table, various facets of its data are sanitised and assessed for their integrity. Specifically the name is sanitised to remove any character other than Latin alphabetic characters, hyphens (-), spaces ( ) and apostrophes ('). Various changes are made to fix common errors, such as the name McDonald appearing as “Mc Donald”; names beginning with “Dr”, “Mr”, “Mrs” etc. and names beginning with hyphens.

In some cases source data has only included a single name string and not a specifically defined forename and surname. In these cases the name parts are extracted, taking into account particles, such as “de la”, “bin” and “van”.

Forenames are assessed to ascertain if they contain more than one name and any extra components are assigned to a middle name. They may only contain one name, including particles in names like “Abd Rahaman” and “La Toya”. One current exception is when the forename was derived from a writing script other than Latin and it was determined the forename should have a space. This will be changed in a future update.

Forenames are always derived from the first part of a given name. In some cases the forename is an initial, in which case the initial is assigned to the middle name and the forename is blank.

Surnames from the Spanish and Portuguese traditions, where individuals usually have a surname from their mother and father, are stored separately so far as Forebears has been able to discern from the source data.

Multiple sources were obtained in a corrupted format, specifically diacritic marks (or accents). In these cases the data was recovered with reference to other sources for the country in question.

Some sources encoded names in ASCII without diacritic marks, when there should be. In such cases diacritics have been inserted as they should be.

A limited number of sources had a significant minority of names back-to-front (surname as forename and vice versa). In these cases Forebears has correctly arranged the names as much as possible. It is also an occasional human data entry error in any database.

Some sources were in writing scripts other than Latin. This presents an issue in that it is not known how each individual may convert their name into the Latin alphabet. This process of conversion is known as transliteration or Romanisation. Further the majority will not have a Latin rendering of their name. The solution Forebears has used, as much as possible, is to use the most prevalent trends in transliteration to systematically convert all names in a given writing script to Latin.

Forebears uses the following methods for transliteration:

  • Arabic (Hassaniya): conversion tables, Government of Mauritania
  • Arabic (standard): Forebears proprietary
  • Armenian: ICU modified
  • Azerbaijani: 'ə$' => 'e', 'Ə' => 'A', 'ə' => 'a'
  • Bengali: conversion tables, Government of West Bengal
  • Bulgarian: Forebears proprietary
  • Burmese: Forebears proprietary
  • Chinese: ICU
  • Dhivehi: conversion tables, Government of The Maldives
  • Farsi: Forebears proprietary
  • Georgian: conversion tables, Government of Georgia
  • Greek: Forebears proprietary
  • Gujarati: conversion tables, Government of Gujarat
  • Hebrew: Forebears proprietary
  • Hindi: conversion tables, Governments of Uttar Pradesh and Rajasthan
  • Japanese: jTalk
  • Kannada: ICU modified
  • Khmer: Forebears proprietary
  • Korean: ICU
  • Macedonian: Forebears proprietary
  • Marathi: conversion tables, Government of Maharashtra
  • Mongolian: Forebears proprietary
  • Nepali: ICU modified
  • Oriya: conversion tables, Government of Odisha
  • Russian: Forebears proprietary
  • Serbian: Forebears proprietary
  • Thai: RTGS
  • Tibetan: conversion tables, Government of Bhutan
  • Ukrainian: Forebears proprietary
  • Urdu: Forebears proprietary
  • Uzbek: Forebears proprietary

The gender of individuals are sanitised to only include male, female and in a minority of cases, other. X and Y are used to denote the gender of individuals who appear with no forename or their forename is an initial. This is to maintain the sex ratio of usable forenames for producing statistics.

Dates of birth are checked to be valid and within a reasonable time period (i.e. not born in 1500).

These are the basic functions that are regularly performed on data. Many sources required specific attention, such as extracting names from elaborate strings including patronymic and matrimonial references and Hungary where many women appeared with their husband's forename and a suffix denoting “wife of”.

In countries where many people do not have surnames (Indonesia, Myanmar) the part of the name that would be used to create a surname from in a Western context has been considered as a surname.

As of September 2019, 145 jurisdictions appear with at least one level of administrative divisions within it. For example within The United States name distribution statistics can be viewed at a state and county/independent city level. Assigning individuals to administrative divisions was often simple, as many sources delineated individuals as such. Others had to be inferred from postal code and/or city, which was not always a simple task due to changes in postal codes, administrative boundaries and a variety of other issues. Administrative divisions are assigned from GeoPostcodes's global postal code database.

Once administrative divisions are assigned, the resultant taxonomy is verified against GeoPostcodes' data and other sources to ensure there are no omissions, duplications or erroneous additions.

The incidence of a name in a jurisdiction's administrative division may be lower than in the jurisdiction owing to some individuals not being assigned to a division.

A small minority of administrative divisions are missing from the source data and appear as empty.

The percentage of each administrative division's population that is represented can vary.

Forebears have assigned administrative divisions based on those used at the time individuals were referenced to administrative divisions. Administrative divisions will not be updated to account for future changes.

A small number of individuals are not assigned to a place owing to insufficient or ambiguous geographic references. A number of people who could be individually assigned to administrative divisions but would require being individually catalouged have not been assigned to divisions, owing to it being a extremely inefficient use of resources that would have considerably delayed the project.

3) Compilation

This process is followed for each jurisdiction.

1) Firstly individuals are grouped by diacritic-sensitive name by their lowest level administrative division (so in The United States, counties) or by no division if they are not assigned to one. When compiling for forenames gender is added to the grouping. Only the lowest administrative division for each individual is used because some jurisdictions do not have a universal structure for delineating divisions.

1a) Due to an unequal gender ratio in data for Macedonia, Tajikistan, Turkmenistan and Uzbekistan surnames are adjusted to the sex ratio of the country.

1b) A small number of jurisdictions with a small sample have the names immigrants moved to another table, so their incidence is not scaled up to find the approximate number of people with that name.

1c) Due to an over-representation of guest workers in the following countries: Bahrain, Hong Kong, Kuwait, Macau , Oman, Qatar, Saudi Arabia, Singapore and Taiwan, the incidence of names is modified to be in line with the representation of various ethnicities.

1d) Western forenames of Chinese people in China, Hong Kong, Macau, Singapore and Taiwan are ignored, e.g. Toby Ng.

2) The built data is re-combined to be case-insensitive.

3) When building forenames empty forenames are removed prior to building statistics.

4) When building forename statistics each administrative division (or the entire jurisdiction if no division) is assessed against the sex ratio of the country and the incidence of forenames is adjusted to bring it in line with the jurisdiction's sex ratio if need be.

5) When building forename statistics names are merged to combine incidence for the same name with different genders, e.g. males and females with the name Alex.

6) The population of each administrative division within the current jurisdiction is called from a database table. This is used to find the multiplier the sample for each administrative division needs to be adjusted by. Those within divisions are adjusted, while any not within division are left as it.

7) Any names ignored in step 1b) are reintroduced.

8) When building surname statistics any blank surnames are deleted. They are deleted at this stage as there are a number of countries where many people have no surname, such as India.

9) If the current jurisdiction has administrative divisions the built statistics are now used to create the higher level administrative divisions (including the jurisdiction) from the lowest division.

10) With statistics built the percentage share of all names and the rank of each name is calculated for each administrative division and the jurisdiction. The ordinal ranking method is been used to produce the rankings. The method ranks the name that occurs most in the area first. Name are then ranked in descending order of their incidence with an increment of one. When two or more names occur the same number of times, they share the same rank. Successive rank is incremented by the total preceding name.

Surname Incidence Rank
Wang 100 1
Li 90 2
Chong 90 2
Chen 80 4

With each jurisdiction built, they are compiled to produce the incidence, percentage of all names and rank at a global, continental, georegional (e.g. Western Europe) and onoregional* level. Finally each name is assessed to determine the country in which it has the highest incidence and is most numerous compared to other names.

*Onoregions are regions delineated by Forebears denoting areas within georegions that share similar naming traditions.

Limitations

The primary limitation of Forebears Names is the inability to obtain data on all living individuals. Approximations derived from a small percentage of a population miss many names and can produce moderate inaccuracies in rankings in balanced samples, much larger in imbalanced samples. Forebears seek to address this by continually seeking new source data. However, Forebears is the largest geospatial names database, produced from a sample eight times larger than the nearest comparable service. As such Forebears provides the most comprehensive data for most jurisdictions; and the only data in many cases. There is also very little publicly available data on the distribution of forenames. To Forebears' knowledge there are three country-level services built from a larger sample than Forebears, which are listed in Appendix III.

2) Beyond censuses, which are typically conducted every ten years and not made publicly available, the currency of sources varies. The most commonly used source is voter lists, which cover most or a large portion of a country's adult population. However these include deceased people, sometimes in small but notable quantities. They also may not be updated after someone moves.

3) Some sources may be biased towards certain ethnic groups or those with higher incomes, which is not distributed equally by name.

4) Source data may contain human data-input errors and some data is self-reported, which may include names like “fghfghfghf” or “Jones Brothers Ltd”. Where these have been identified they have been removed.

5) Due to human data-input errors names occasionally occur back to front, e.g. Smith as a forename and John as a surname. In some databases from developing countries this was more common and in those cases Forebears attempted to rectify the issue as much as is possible.

6) Some sources are biased towards certain age groups, either the young (5-18), adults (18+) or those more economically active (25/30-60/70). This will most notably cause inaccuracies in forename distribution, as trends in naming babies can move dramatically over a generation. It will also affect accuracies in surname distribution where immigration is a factor.

Appendix I: Sample Sizes

Below is a table showing the percentage of each jurisdiction's population that appears in Forebears Names' source data.

CountrySample Size (%)
Georgia100
Spain100
Israel100
Armenia100
Czech Republic100
United States100
Pitcairn Islands100
Ukraine99.6281
Taiwan99.6114
Abkhazia98.9871
Sweden97.2559
Bulgaria96.517
Kosovo96.4718
China94.6195
Saint Lucia94.2909
Norway93.5677
South Korea92.9869
Trinidad and Tobago91.5853
Slovenia89.6028
Finland88.354
Chile88.0213
Indonesia87.6476
Anguilla86.1433
Saint Vincent and The Grenadines85.9841
Poland83.9834
Marshall Islands79.4265
Lesotho78.8843
Philippines76.5383
Cook Islands73.0529
Peru72.0211
Costa Rica71.8452
Monaco71.4844
Scotland70.9169
Croatia69.1766
United States Virgin Islands68.9192
Mexico68.8096
Turkey66.8737
Grenada66.7275
Norfolk Island66.4639
Maldives65.8779
Guyana65.5974
Nauru65.2068
Lebanon65.2032
Saint Kitts and Nevis65.1228
Argentina64.9276
Venezuela64.7
Iceland63.928
Puerto Rico63.6864
Saint Pierre and Miquelon62.4733
El Salvador62.4351
India62.3176
Nicaragua61.8202
Jersey60.9373
Panama60.911
Slovakia60.4236
British Virgin Islands58.8361
England57.8076
Azerbaijan57.2106
Canada56.9876
Denmark56.8946
Cayman Islands56.6993
Wales55.6053
Papua New Guinea54.6311
Australia54.0804
Honduras52.9872
Cape Verde52.9174
Sao Tome and Principe51.9546
Uruguay51.3979
Cambodia51.1247
Kyrgyzstan50.9016
Bhutan50.4298
Russia49.0728
Liechtenstein48.1295
Montserrat48.0699
Brazil47.4398
Saint Helena Ascension and Tristan Da Cunha47.1741
Switzerland46.5344
Bermuda45.4084
Macedonia45.1393
Belarus44.9081
Belize44.7883
Montenegro44.6058
Netherlands44.5946
Moldova44.5832
Pakistan44.4554
San Marino44.3631
Benin44.3528
Nepal44.0495
Isle of Man43.9562
Paraguay43.45
New Zealand42.6687
Vietnam42.1542
American Samoa42.0328
South Ossetia41.1361
South Africa40.8694
Palestine40.6801
Turks and Caicos Islands40.4197
Niger39.9727
Belgium39.9002
Northern Ireland39.816
Colombia39.6764
Niue39.4296
Senegal38.9956
Liberia38.6844
Nigeria38.4484
Uganda37.9698
Gibraltar37.6763
Northern Mariana Islands37.0039
Germany36.8024
Antigua and Barbuda36.608
Latvia36.2917
Barbados35.2443
Mauritania32.4168
Austria32.3281
Ireland32.3002
Jordan32.077
Solomon Islands32.0136
Ecuador31.5766
Zimbabwe31.4489
Andorra31.3692
Luxembourg31.2874
Greenland30.8668
Iran30.8147
Jamaica30.6644
Transnistria30.6377
Ivory Coast30.1093
France29.7655
Falkland Islands29.2581
Cyprus28.9219
Estonia28.3213
Hungary27.4044
Guam26.9932
United Arab Emirates26.7623
Bosnia and Herzegovina26.6208
Serbia26.3722
Aruba25.9586
Burkina Faso25.8066
Cameroon25.7179
New Caledonia24.4795
Greece23.8838
Yemen23.4968
Italy23.2548
Bahamas22.074
Kazakhstan21.8617
French Polynesia20.754
Singapore20.0904
Dominica19.9777
Guernsey19.904
Botswana19.5054
Iraq19.2098
Malta18.9377
Oman18.8746
Lithuania18.8726
Zambia18.6179
Mauritius17.981
Saint Barthelemy17.5029
Suriname16.9967
Mongolia16.4875
Namibia15.7669
Romania15.3673
DRCongo15.2121
Malaysia15.0674
Algeria14.2567
Dominican Republic13.9566
Japan13.8864
Brunei13.6021
Faroe Islands12.8317
Seychelles12.1355
Micronesia11.4133
Portugal11.1799
Kuwait11.1786
Thailand11.0061
Qatar10.0262
Kenya9.7568
Wallis and Futuna9.5775
Albania9.2685
Hong Kong8.439
Haiti8.2689
Tuvalu7.9927
Palau7.6766
Bahrain7.6359
Guatemala7.4689
Tanzania7.0847
Saint Martin6.9815
Cuba6.0307
Bolivia5.8926
Kiribati5.555
Vanuatu5.4279
Tonga5.0643
Tunisia5.0363
Syria4.7243
Fiji4.7199
Djibouti4.5811
Swaziland4.393
Malawi4.2594
Samoa4.0158
Macau3.9076
Morocco3.5919
Gabon3.4761
Uzbekistan2.8366
Saudi Arabia2.7967
Somalia2.2936
Afghanistan2.2096
Tajikistan2.0175
Ghana1.7606
Northern Cyprus1.6708
Sri Lanka1.6422
Mali1.6245
Turkmenistan1.5976
Togo1.3964
Bangladesh1.1586
Gambia0.9721
Egypt0.9354
Equatorial Guinea0.8192
Libya0.763
Ethiopia0.7492
Angola0.7433
Rwanda0.5641
Sudan0.5336
Comoros0.4863
East Timor0.4733
Madagascar0.4584
Congo0.4549
Myanmar0.4233
Guinea0.3532
South Sudan0.3144
Mozambique0.2871
Sierra Leone0.2744
Burundi0.2134
Laos0.2033
Guinea Bissau0.1708
Chad0.1692
Central African Republic0.1564
Eritrea0.1473
North Korea0.0095

Appendix II: Sources

Owing to its propriety nature, Forebears does not cite sources used unless required to by law or the data is historic.

Partial list of sources:

  • Hagstova Føroya. (2015). Boys names 2001-2014. Retrieved from URL
  • Hagstova Føroya. (2015). Female names 2001-2014. Retrieved from URL
  • Hagstova Føroya. (2015). Surnames 2001-2014. Retrieved from URL
  • Instituto Nacional de Estadistica. (2018). Frecuencias de apellidos. Retrieved from URL
  • Instituto Nacional de Estadistica. (2018). Frecuencias de nombres. Retrieved from URL
  • Ministerstvo Vnitra České Republiky. (2016). Četnost jmen a příjmení. Retrieved from URL
  • Statistični urad Republike Slovenije. (2018). Imena dečkov, Slovenija, letno.
  • Statistični urad Republike Slovenije. (2018). Imena deklic, Slovenija, letno.
  • Statistični urad Republike Slovenije. (2018). Priimki, Slovenija, letno.
  • The Church of Jesus Christ of Latter Day Saints. (2001). 1880 United States Census.
  • The Church of Jesus Christ of Latter Day Saints. (2001). 1881 British Census.
  • The National Archives of Ireland. (2007). Ireland 1901 Census.
  • 전자가족관계등록시스템. (2019). 아기 이름 빈도.
  • 통계청. (2015). 성, 가족 기원 및 종교 관련 항목 조사

Appendix III: Services With a Larger Sample

The following services provide surname distribution statistics with a larger sample than Forebears.

Belgium: Familienaam provides distribution data from Belgium's 1998 and 2008 population registers.

France: Filae.com provides distribution data based on birth data from l'Institut National de la Statistique et des Études Économiques. This data is based on births and not where people were living in a given year.

Netherlands: The Meertens Instituut provides distribution data on all people with Dutch nationality who lived in The Netherlands in 2007.

Names - Distribution

Forebears provides data on name distribution and is not involved in the delineation of nations. The website does not specifically define any of the following disputed territories as nations or part of nations:

Abkhazia, Artsakh, Ceuta and Melilla, Crimea, Golan Heights, Israel, Kashmir, Kosovo, Northern Cyprus, Palestine, Sahrawi Republic, Somaliland, South Ossetia, Taiwan and Transnistria.

They all appear as they are controlled de facto. This decision was taken as the records Forebears uses are obtained from sources within the de facto jurisdiction; the de jure or other claimants have few or no record as to the names of who lives in them. The administrative divisions within disputed territories also often differ form the de jure administrative divisions. This is current as of April 2014.

The delineations have not been made for political reasons and Forebears does not comment on boundary disputes.

The Donetsk People's Republic and The Luhansk People's Republic are not included as they were not established at the time of creation; and it is unclear of the situation in this area. The Autonomous Administration of North and East Syria is not included as no data has been obtained covering its boundaries. It appears as part of Syria.

Forebears uses a database of around four billion people to produce statistics relating to forenames and surnames. Thus they are approximations.

Due to the amount of resources it would require to keep an accurate record of everyone alive Forebears does not seek to make edits to the underlying data.

Names containing diacritical marks (or accents) will be considered different to those that don't. For example the name Öztürk (568,848 bearers) is treated differently to Ozturk (7,920 bearers).

Names may contain any character in the Latin Unicode range as well as apostrophe ('), hyphen (-) and space ( ). These are also considered differently. For example, Jones-Williams is considered different to Jones Williams.

Due to the copyright status of various sources used and Forebears' own privacy concerns, no details on living people will be disbursed under any circumstances. If you wish to locate people with a given name, it is recommend you consult white pages or hire a private investigator.

The source data used to compile incidence does not always list a place, or region of residence. This generally relates to prisoners, military, police and foreign nationals living in a country; and to a lesser degree people whose specified place of residence could not be determined. For this reason a country may list, say ten people with a name, but only a total of nine in all the regions of the country.

As of the 4th of September 2018 the Forebears database contains 26,445,869 surnames, of which 234,267 are considered extinct.

The previous edition of the database released on the 15th of September 2014 included 11,303,059 surnames. Before that the database released on the 26th of April 2013 included 488,661 surnames. The initial database launched on the 20th of June 2012 included 424,349 surnames.

Forebears does not remove names from the website. Names cannot be owned and the factual information relating to the meaning or distribution of a surname cannot be subject to copyright.

Names - Lexical
  1. Some data sources used to produce the name statistics contain both a Latin and non-Latin rendering of a name. In these cases the Latin rendering was used, as is
  2. Latin and non-Latin data was used for a number of countries
  3. In the case that a standardised transliteration method was used, non-Latin forms were consistently transliterated to the same Latin rendering
Data / API

Yes. Data can be accessed via API, CSV/Excel upload and web interface at OnoGraph.

Forebears does not assist those who want to mine data from the website. Due to huge levels of data mining that peaked at over 75% of total requests to the website, the site now uses a hard firewall to block such requests and a soft one to return random data when unauthorised robot access is detected.

No; though a commercially available API is planned.