What are the best practises to create metadata?

The objective of this BEST PRACTICES guide is to help the IAI investigators during the metadata creation process in order to improve the usability and visibility of their data.

1. Assign descriptive file names to each metadata created

Metadata file names should reflect the contents of the data and include enough information to uniquely identify the file. As a suggestion, file names should contain information such as program acronym, project number, investigator name or number, type of data, or any other pertinent information.

It is important to observe that clear, descriptive and unique file names may be important later when a metadata is combined in a directory or FTP site with other files. Avoid using file names such as mypersonaldata1.xml or data1.xml. An example of a good file name is IAI-ScientificArticle_Prog_001_Studies_999.xml , where IAI is the name of the institute, ScientificArticle represents a publication, Prog is the program acronym, 001 is the project number, Studies is a particular word or expression that represent the data, 999 is a reference number for that specific publication and .xml is the format of metadata created.

When naming a metadata, avoid using special characters that could be changed in different computer plataforms. You may want to use similar logic when naming directories, data or inserting a title for a metadata.

2. Use consistent and stable file formats when submitting data

ASCII format is the best way to ensure that a certain data is readable into the future and to avoid portability problems in different plataforms, however this format is not exclusive.

The formats are quoted inside each one of the following three data types: Textual or Graphical data; Numerical, Coded or Raw data; and Binary or any other data structure.

Textual or Graphical data

Typical examples of textual data are publications (theses, papers or abstracts) that are largely self-documenting. The following formats are good for this kind of documents: PDF, HTML or RTF. Examples of graphical data are posters, images, photos or presentations. The formats GIF, JPG, TIFF or EPS/PDF are good options for this type.

Numerical, Coded or Raw data

Data are distributed in a consistent form (one per line) with individual values separated by commas or any other division character. A CSV (Comma Separated Value) is a good example of this type of data. When some specific item is composed of strings, its really important to insert that information using the quote character.

For example:
Program, Project Number, Title, PI, Year of Conclusion
PROG, 001, The use of soils in Amazon Region, Jose Maria Braga, 2004

Binary data or any other data structure

Data submitted in a proprietary binary format that can be read only by the software that produced it. All information about the binary format should be well documented in the metadata, so future users will be able to open that data in order to visualize and analyse it. Although this is a valid format, it is really better to avoid it and use a more universal format, like PDF, if the situation allows this format.

3. Document data by using specific metadata fields

In order for others to identify a certain data through the DIS Search Process, the metadata should have several fields filled to better document a specific data, so the user would easily identify the one which is more appropriate for his/her research. The DIS has four different groups of fields: 1) Identification Information, where itens like type of data, title, abstract, scientific keywords are registered, 2) Data Quality Information, where information about completeness are inserted, 3) Distribution Information to document online linkages (file formats and addresses) and 4) Metadata Reference Information to document who created the metadata, access information, etc.

Below a group of suggestions for some fields that can be referenced:

Dates: yyyymmdd, e.g., January 2, 1997 is 19970102.

Spatial Coordinates: Spatial coordinates should be recorded in decimal degrees format to at least 5 significant digits past the decimal point. Provide latitude and longitude with south latitude and west longitude recorded as negative values. (e.g. -50.3 for 50° 30´ 00´´ W)

When Show pick list appears in front of a field, select this option in order to visualize the available possibilies and use one of them in order to make the search process easier and standardized. Although the list is not exclusive, it is suggested that only the listed options are used.

Add another after a field means that another group of information can be inserted. (e.g. For Parameter Description, researchers might want to insert more than one keyword in order to make a metadata more complete)

4. Use consistent data organization

When disseminating data in a raw format, it is suggested that the information is organized in one unique file. Whichever style you use, be sure to place the name of fields/parameters on the first line of the file, and then in each subsequent line insert the actually records. In practical means, most often each row in a file represents a complete record and the columns represent all the fields/parameters that make up the record. For example:

Program Project PI IAI Scientific Theme
XXX 001 Marcelo I. I and III
YYY 003 Carlos Soares IV
ZZZ 004 ---- **

In the previous example, the third record has ---- as PI and ** as IAI Scientific Theme.


represents a missing value and ** represents all themes. These information should be declared in some part of the metadata in order to avoid misunderstandings.

An important issue with data organization is the number of records in each file. In general, keep a set of similar measurements together in one file and specially do not break up the data into many small files, so researchers who later use that data will not have to process many small files individually. There is an upper limit to the size of files, though. Large files (on the order of several tens of thousands of records, or several megabytes) do become unwieldy and may be too large for some applications. These very large data files need to be broken into logical smaller files.

If you are collecting several different types of measurements at a site, place each type of measurement in a separate file, but remember to use similar data organization, so users will easily understand the interrelationships between them.

5. Perform basic quality assurance on your data

It is suggested that a minimum basic data quality assurance be made before liberating a data set to the public, specially raw data files. Here are some actions that could be done by the responsible: 1) Make sure that the information is well organized and all the columns represents the fields/parameters you want to document; 2) Check file organization in order to ensure that there are no missing values for key parameters; 3) Insert some kind of sort criteria to list the records; 4) Scan parameters for impossible values; and 5) Review records in order to avoid mistakes.

6. Assign descriptive metadata titles and data file names

It is recommend that metadata titles be as descriptive as possible. Also the data file names should fastly reference the corresponding metadata. If a metadata belongs to a large scientific program, you may want to use that information as part of the title. If the metadata is referencing a publication or any scientific article, it is advisable that the title of the publication be used as the metadata title.

In addition, the length of the title sould be restricted to 80 characters (spaces included) to be compatible with other global change data collections.

7. Provide clear documentation for each metadata

The metadata should be created for a user 10 or 20 years into the future (what does that investigator need to know to use the data?) Use all pertinent fields in order to make the information clear and specially direct the metadata for a user who is unfamiliar with your project, methods, or observations.

To ensure that data can be read in the future, it is suggested that the ASCII format is used for text and GIF, JPG, TIFF is used for figures, maps or pictures. Proprietary formats such as RTF or PDF are also suitables formats.

In addition, all metadata created within the DIS should answer the following information to the user: 1) What is the scientific reason for this data collection? 2) What instruments were used to obtain the data set? 3) Who collected the data and who to contact with questions? 4) What type is the referenced resource? 5) What is the IAI scientific program? 6) What are the scientific keywords? 7) What are the software used to prepare the data set? 8) What are the special codes used, including those for missing values? 9) When the data was last modified? and 10) Where can the data set be retrieved?

8. Bibliography

Porter, J.H. 1997. Data and Information Submission at the Virginia Coast LTER. Available on-line at: http://www.vcrlter.virginia.edu/data/submission.html

Robert B. Cook, Richard J. Olson, Paul Kanciruk, and Leslie A. Hook. 2000. Best Practices for Preparing Ecological and Ground-Based Data Sets to Share and Archive. Available online at: https://daac.ornl.gov/cgi-bin/MDE/S2K/bestprac.html

Topic revision: r3 - 2006-04-17 - LuisMarceloAchite
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback