IPNI logo - link to home The International Plant Names Index

Standardizing the collation format

After cleaning, the data is put through a flexible process (a series of Python scripts) capable of carrying out one, all or any combination of the treatments below.

  • Converting roman numeral volume numbers
    The results of automating this update need to be carefully checked as scanning errors can cause problems e.g. volume iv is frequently mis-scanned as lv and translated automatically into volume 45 and the character ‘l’ may be intended to be either letter l (roman numeral for 50) or number 1, or is sometimes a scanning error for ‘t.’ (tabula). We run this update on complete publication datasets which then undergo checking, as blind conversions would undoubtedly proliferate the number of errors and inconsistencies already present in the data.

  • Converting roman numeral part numbers and formatting correctly
    In ‘xiii. II. 381’ for example, part number (II.) has to be distinguished from the volume number (xiii.) and the page number. Often the year of publication is present in the collation field too. The original punctuation has to be removed and the whole string reformatted as ‘13(2): 381’.

  • Moving date from collation field to year field
    Publication date is identified in the collation and removed to the publication year field, provided it is not acting as volume number.

  • Copying publication date from collation field to year field
    Where the volume number appears as the publication year this data is not moved into the publication field but may be copied only if the actual year of publication is the same.

  • Reordering the collation
    May be necessary if the collation is non-standard in its format.

The collations of the legacy data in IPNI are widely varying in their construction, there being no such thing as a standard format. Each dataset we have dealt with has had peculiarities of its own (e.g. ‘Lit.’ in Linnaea, ‘Anhang’ in Repert. Spec. Nov. Regni Veg. Beih.) Also, there are many records that include remarks, like ‘in syn.’ in the collation field or that contain a double reference citation, separated by a semicolon. The latter all need to be checked carefully to determine which is the valid reference.

See how this work is progressing.

Back to IPNI Home

 

© Copyright 2004 International Plant Names Index

contact us