IPNI logo - link to home The International Plant Names Index

Cleaning the data

The first stage in the data improvement process was to fix the citations which had suffered parsing problems on import, leaving data inappropriately split across data fields.

These fixes were automated using Python scripts. Each category of error had its own particular requirements and each record was checked manually after splitting to ensure the integrity of the data.

Here are some examples of the types of parsing problems worked on:

Id
Publication Title Field
Collation Field
Approx. number of records similarly affected
406606-1 in Ann. Conserv. & Jard. Bot. Geneve, xvii 296 (1914) 24,000
54523-1 in Engl. & Prantl, Naturl. Pflanzenfam. iv. 3b (1895) 311
328460-1 Bot. Cab. t. 1743 11,000
122151-1 Bot. Reg. sub I. 1068
829401-1 in Meded. 'S Lands Plantent. 16: 278 2000
36603-1 in Sitzb. Math. -Phys. Akad. Muench. viil. (1878) 246 16,500
598078-1 in Bull. Phys.-Math. Akad. Petersb. x. (1852) 342 72,000
214525-1 in Bull. Soc. Bot. France, li. p. lxxvii. (1904)
834387-1 in Wall. Cat. p. 237, n. 108 800
148950-1 in Sargent. Pl. Wi 7son. i. 138 (1911) ?
550062-1 in Kew Bu 71. 1911, 182
490265-1 Fl. Bras., Fasc. 4, xxv. III. 21 (1941) 1900
665923-1 Cat. n. 87 (1871) 6300
752478-1 in Bull. Bishop Mus., Honolulu No. 174, 90 (1943) 8500
360582-1 Notes Mesembr. Pt II. 461 (1934) 2500
130014-1 Hort. Brit. ed. III. 283 25,300

Cleaning misparsed records which have an edition number in the title or the collation has proven to be particularly time consuming. This is because in some in cases ‘ed.’ is part of the standard form of the publication title (as in Sp. Pl., ed. 2), and in others it is part of the collation.

To be sure of capturing all of the possible candidates for this clean-up we had to extract all IPNI records which had ‘ed.’ in either the title field or the collation field (more than 25,000 records). We are in the process of checking and correcting these records on a case-by-case basis.

The situation is similar for those publications involving series and supplements.

When correctly parsed, the data is ready to undergo intensive processing of the collation.

Back to IPNI Home

 

© Copyright 2004 International Plant Names Index

contact us