What are the best practises to create metadata?
The objective of this BEST
PRACTICES guide is to help the IAI investigators during the
metadata creation process in
order to improve the usability and visibility of their data.
1. Assign
descriptive file names to each metadata created
Metadata file names should reflect
the contents of the data and include enough information to uniquely
identify the file. As a suggestion, file names should contain
information such as program acronym, project number, investigator
name or number, type of data, or any other pertinent information.
It is important to observe that
clear, descriptive and unique file names may be important later when
a metadata is combined in a directory or FTP site with other files.
Avoid using file names such as mypersonaldata1.xml or data1.xml. An
example of a good file name is
IAI-ScientificArticle_Prog_001_Studies_999.xml ,
where IAI is the name of the institute,
ScientificArticle represents a publication, Prog
is the program acronym, 001 is the project number,
Studies is a particular word or expression that
represent the data, 999 is a reference number for
that specific publication and .xml is the format of
metadata created.
When naming a metadata, avoid using
special characters that could be changed in different computer
plataforms. You may want to use similar logic when naming
directories, data or inserting a title for a metadata.
2.
Use consistent and stable file formats when submitting data
ASCII format is the best way to
ensure that a certain data is readable into the future and to avoid
portability problems in different plataforms, however this format is
not exclusive.
The formats are quoted inside each
one of the following three data types: Textual or Graphical data; Numerical,
Coded or Raw data; and Binary or any other data structure.
Textual
or Graphical data
Typical examples of textual data
are publications (theses, papers or abstracts) that are largely
self-documenting. The following formats are good for this kind of
documents: PDF, HTML or RTF. Examples of graphical data are posters,
images, photos or presentations. The formats GIF, JPG, TIFF or
EPS/PDF are good options for this type.
Numerical,
Coded or Raw data
Data are distributed in a
consistent form (one per line) with individual values separated by
commas or any other division character. A CSV (Comma Separated Value)
is a good example of this type of data. When some specific item is
composed of strings, its really important to insert that information
using the quote character.
For example: Program, Project
Number, Title, PI, Year of Conclusion PROG, 001, The use of
soils in Amazon Region, Jose Maria Braga, 2004
Binary data or any other data
structure
Data submitted in a proprietary
binary format that can be read only by the software that produced it.
All information about the binary format should be well documented in
the metadata, so future users will be able to open that data in order
to visualize and analyse it. Although this is a valid format, it is
really better to avoid it and use a more universal format, like
PDF, if the situation allows this format.
3. Document
data by using specific metadata fields
In order for others to identify a
certain data through the DIS Search Process, the metadata should have
several fields filled to better document a specific data, so the user
would easily identify the one which is more appropriate for his/her
research. The DIS has four different groups of fields: 1)
Identification Information, where itens like type of data,
title, abstract, scientific keywords are registered, 2) Data
Quality Information, where information about completeness are
inserted, 3) Distribution Information to document online
linkages (file formats and addresses) and 4) Metadata Reference
Information to document who created the metadata, access
information, etc.
Below a group of suggestions for
some fields that can be referenced:
Dates: yyyymmdd, e.g., January 2,
1997 is 19970102.
Spatial
Coordinates: Spatial coordinates should be recorded in decimal
degrees format to at least 5 significant digits past the decimal
point. Provide latitude and longitude with south latitude and west
longitude recorded as negative values. (e.g. -50.3 for 50° 30´
00´´ W)
When Show pick list
appears in front of a field, select this option in order to visualize
the available possibilies and use one of them in order to make the
search process easier and standardized. Although the list is not
exclusive, it is suggested that only the listed options are used.
Add another after a
field means that another group of information can be inserted. (e.g.
For Parameter Description, researchers might want to insert more than
one keyword in order to make a metadata more complete)
4. Use
consistent data organization
When disseminating data in a raw
format, it is suggested that the information is organized in one
unique file. Whichever style you use, be sure to place the name of
fields/parameters on the first line of the file, and then in each
subsequent line insert the actually records. In practical means, most
often each row in a file represents a complete record and the columns
represent all the fields/parameters that make up the record. For
example:
Program Project PI IAI Scientific
Theme XXX 001 Marcelo I. I and III YYY 003 Carlos Soares IV
ZZZ 004 ---- **
In the previous example, the third
record has ---- as PI and ** as IAI Scientific Theme.
represents a missing value and ** represents all
themes. These information should be declared in some part of the
metadata in order to avoid misunderstandings.
An important issue with data
organization is the number of records in each file. In general, keep
a set of similar measurements together in one file and specially do
not break up the data into many small files, so researchers who later
use that data will not have to process many small files individually.
There is an upper limit to the size of files, though. Large files (on
the order of several tens of thousands of records, or several
megabytes) do become unwieldy and may be too large for some
applications. These very large data files need to be broken into
logical smaller files.
If you are
collecting several different types of measurements at a site, place
each type of measurement in a separate file, but remember to use
similar data organization, so users will easily understand the
interrelationships between them.
5. Perform
basic quality assurance on your data
It is suggested that a minimum
basic data quality assurance be made before liberating a data set to
the public, specially raw data files. Here are some actions that
could be done by the responsible: 1) Make sure that the information
is well organized and all the columns represents the
fields/parameters you want to document; 2) Check file organization in
order to ensure that there are no missing values for key parameters;
3) Insert some kind of sort criteria to list the records; 4) Scan
parameters for impossible values; and 5) Review records in order to
avoid mistakes.
6. Assign
descriptive metadata titles and data file names
It is recommend that metadata
titles be as descriptive as possible. Also the data file names should
fastly reference the corresponding metadata. If a metadata belongs to
a large scientific program, you may want to use that information as
part of the title. If the metadata is referencing a publication or
any scientific article, it is advisable that the title of the
publication be used as the metadata title.
In addition, the length of the
title sould be restricted to 80 characters (spaces included) to be
compatible with other global change data collections.
7. Provide
clear documentation for each metadata
The metadata should be created for
a user 10 or 20 years into the future (what does that investigator
need to know to use the data?) Use all pertinent fields in order to
make the information clear and specially direct the metadata for a
user who is unfamiliar with your project, methods, or observations.
To ensure that data can be read in
the future, it is suggested that the ASCII format is used for text
and GIF, JPG, TIFF is used for figures, maps or pictures. Proprietary
formats such as RTF or PDF are also suitables formats.
In addition, all metadata created
within the DIS should answer the following information to the user:
1) What is the scientific reason for this data collection? 2) What
instruments were used to obtain the data set? 3) Who collected the
data and who to contact with questions? 4) What type is the
referenced resource? 5) What is the IAI scientific program? 6) What
are the scientific keywords? 7) What are the software used to prepare
the data set? 8) What are the special codes used, including those for
missing values? 9) When the data was last modified? and 10) Where can
the data set be retrieved?
8.
Bibliography
Porter, J.H. 1997. Data and
Information Submission at the Virginia Coast LTER. Available
on-line at: http://www.vcrlter.virginia.edu/data/submission.html
Robert B. Cook, Richard J. Olson,
Paul Kanciruk, and Leslie A. Hook. 2000. Best Practices for
Preparing Ecological and Ground-Based Data Sets to Share and Archive.
Available online at:
https://daac.ornl.gov/cgi-bin/MDE/S2K/bestprac.html