Users can upload CSV and Excel files to be enriched with nationality and gender predictions.
Below you will find general use guidelines for Excel and CSV uploads.
Computers use a number of maps (technically known as character encoding) to link digitally stored information to characters, e.g. the Latin A, the Cyrillic Ш and Armenian Ր. Popular encodings include UTF-8, ISO-8859-1 and Windows-1251.
The twenty-six letters of the English Latin alphabet are digitally represented the same in most encodings. But other characters, such as the Latin é are typically encoded differently. For this reason it is important to know what encoding your file uses.
As a example reading the names encoded in one character encoding (UTF-8) in another encoding (Windows 1252) will lead to the names Núñez and Jovanović being rendered as Núñez and Jovanović. OnoGraph cannot recognize such names when there is an encoding conflict.
It is not possible for OnoGraph to detect which character encoding your file uses; although it does attempt to predict it.
We recommend you encode all your files in UTF-8, the encoding we use, to avoid problems with character encoding. Consult the following guides to convert your files to UTF-8 using Excel and OpenOffice.
If your file is not UTF-8 encoded you should carefully check the input preview on the configuration page. This shows the first 100 rows of the file you uploaded. Pay attention to any characters that are displayed incorrectly. Use the character encoding drop-down to select different encodings until your data is displayed correctly.
The following screenshot shows a preview of a file where the encoding was not detected correctly. Instead of accented Latin characters it shows Chinese characters or a box.
Select options from the Character Encoding dropdown to preview the sample with different encodings. These are ordered by alphabet groups. When you select the correct encoding the preview will be displayed with the correct characters, e.g.
If you cannot get your file to display with the correct characters it is likely corrupt or uses an unsupported encoding. In the case of the latter we advise you convert it to UTF-8.
OnoGraph will attempt to detect the enclosures and separators of uploaded CSV files. The separator is the character that delineates columns. This is typically comma (,) but is often semi-colon (;) or tab (\t). The enclosure is the character used to wrap column contents so as to be able to include the separator, when the contents includes the same character as the separator. For example if the separator is comma (,) and a column contains a comma.
The enclosure and separator of your file should be detected automatically, but if they aren't your file will appear with all or most rows appearing in one column. For example:
In this case the separator is set as semi-colon (;), but you can see in the preview the actual separator is comma (,).
By typing a comma (,) into the Column Separator field the preview will be shown correctly and OnoGraph will now be able to process the file.
The upload facility accepts files in either Excel or CSV format and they may be compressed with either gzip or ZIP compression.
Files must be no larger than 2GB (gigabytes). Compressed files can be up to 2GB, with no limit on their uncompressed size.
Compressed files should contain only one CSV or Excel file.