wndb
WNDB(5WN) WordNettm File Formats WNDB(5WN)
NAME
index.noun, data.noun, index.verb, data.verb, index.adj, data.adj, in-
dex.adv, data.adv - WordNet database files
noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists
sentidx.vrb, sents.vrb - files used by search code to display sentences
illustrating the use of some specific verbs
DESCRIPTION
For each syntactic category, two files are needed to represent the con-
tents of the WordNet database - index.pos and data.pos, where pos is
noun, verb, adj and adv. The other auxiliary files are used by the
WordNet library's searching functions and are needed to run the various
WordNet browsers.
Each index file is an alphabetized list of all the words found in Word-
Net in the corresponding part of speech. On each line, following the
word, is a list of byte offsets (synset_offsets) in the corresponding
data file, one for each synset containing the word. Words in the index
file are in lower case only, regardless of how they were entered in the
lexicographer files. This folds various orthographic representations
of the word into one line enabling database searches to be case insen-
sitive. See wninput(5WN) for a detailed description of the lexicogra-
pher files
A data file for a syntactic category contains information corresponding
to the synsets that were specified in the lexicographer files, with re-
lational pointers resolved to synset_offsets. Each line corresponds to
a synset. Pointers are followed and hierarchies traversed by moving
from one synset to another via the synset_offsets.
The exception list files, pos.exc, are used to help the morphological
processor find base forms from irregular inflections.
The files sentidx.vrb and sents.vrb contain sentences illustrating the
use of specific senses of some verbs. These files are used by the
searching software in response to a request for verb sentence frames.
Generic sentence frames are displayed when an illustrative sentence is
not present.
The various database files are in ASCII formats that are easily read by
both humans and machines. All fields, unless otherwise noted, are sep-
arated by one space character, and all lines are terminated by a new-
line character. Fields enclosed in italicized square brackets may not
be present.
See wngloss(7WN) for a glossary of WordNet terminology and a discussion
of the database's content and logical organization.
Index File Format
Each index file begins with several lines containing a copyright no-
tice, version number and license agreement. These lines all begin with
two spaces and the line number so they do not interfere with the binary
search algorithm that is used to look up entries in the index files.
All other lines are in the following format. In the field descrip-
tions, number always refers to a decimal integer unless otherwise de-
fined.
lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]
lemma lower case ASCII text of word or collocation. Colloca-
tions are formed by joining individual words with an un-
derscore (_) character.
pos Syntactic category: n for noun files, v for verb files,
a for adjective files, r for adverb files.
All remaining fields are with respect to senses of lemma in pos.
synset_cnt Number of synsets that lemma is in. This is the number
of senses of the word in WordNet. See Sense Numbers be-
low for a discussion of how sense numbers are assigned
and the order of synset_offsets in the index files.
p_cnt Number of different pointers that lemma has in all
synsets containing it.
ptr_symbol A space separated list of p_cnt different types of
pointers that lemma has in all synsets containing it.
See wninput(5WN) for a list of pointer_symbols. If all
senses of lemma have no pointers, this field is omitted
and p_cnt is 0.
sense_cnt Same as sense_cnt above. This is redundant, but the
field was preserved for compatibility reasons.
tagsense_cnt Number of senses of lemma that are ranked according to
their frequency of occurrence in semantic concordance
texts.
synset_offset Byte offset in data.pos file of a synset containing
lemma. Each synset_offset in the list corresponds to a
different sense of lemma in WordNet. synset_offset is
an 8 digit, zero-filled decimal integer that can be used
with fseek(3) to read a synset from the data file. When
passed to read_synset(3WN) along with the syntactic cat-
egory, a data structure containing the parsed synset is
returned.
Data File Format
Each data file begins with several lines containing a copyright notice,
version number and license agreement. These lines all begin with two
spaces and the line number. All other lines are in the following for-
mat. Integer fields are of fixed length, and are zero-filled.
synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
synset_offset Current byte offset in the file represented as an 8
digit decimal integer.
lex_filenum Two digit decimal integer corresponding to the lexicog-
rapher file name containing the synset. See lex-
names(5WN) for the list of filenames and their corre-
sponding numbers.
ss_type One character code indicating the synset type:
n NOUN
v VERB
a ADJECTIVE
s ADJECTIVE SATELLITE
r ADVERB
w_cnt Two digit hexadecimal integer indicating the number of
words in the synset.
word ASCII form of a word as entered in the synset by the
lexicographer, with spaces replaced by underscore char-
acters (_). The text of the word is case sensitive, in
contrast to its form in the corresponding index.pos
file, that contains only lower-case forms. In data.adj,
a word is followed by a syntactic marker if one was
specified in the lexicographer file. A syntactic marker
is appended, in parentheses, onto word without any in-
tervening spaces. See wninput(5WN) for a list of the
syntactic markers for adjectives.
lex_id One digit hexadecimal integer that, when appended onto
lemma, uniquely identifies a sense within a lexicogra-
pher file. lex_id numbers usually start with 0, and are
incremented as additional senses of the word are added
to the same file, although there is no requirement that
the numbers be consecutive or begin with 0. Note that a
value of 0 is the default, and therefore is not present
in lexicographer files.
p_cnt Three digit decimal integer indicating the number of
pointers from this synset to other synsets. If p_cnt is
000 the synset has no pointers.
ptr A pointer from this synset to another. ptr is of the
form:
pointer_symbol synset_offset pos source/target
where synset_offset is the byte offset of the target
synset in the data file corresponding to pos.
The source/target field distinguishes lexical and seman-
tic pointers. It is a four byte field, containing two
two-digit hexadecimal integers. The first two digits
indicates the word number in the current (source)
synset, the last two digits indicate the word number in
the target synset. A value of 0000 means that
pointer_symbol represents a semantic relation between
the current (source) synset and the target synset indi-
cated by synset_offset.
A lexical relation between two words in different
synsets is represented by non-zero values in the source
and target word numbers. The first and last two bytes
of this field indicate the word numbers in the source
and target synsets, respectively, between which the re-
lation holds. Word numbers are assigned to the word
fields in a synset, from left to right, beginning with
1.
See wninput(5WN) for a list of pointer_symbols, and se-
mantic and lexical pointer classifications.
frames In data.verb only, a list of numbers corresponding to
the generic verb sentence frames for words in the
synset. frames is of the form:
f_cnt + f_num w_num [ + f_num w_num...]
where f_cnt a two digit decimal integer indicating the
number of generic frames listed, f_num is a two digit
decimal integer frame number, and w_num is a two digit
hexadecimal integer indicating the word in the synset
that the frame applies to. As with pointers, if this
number is 00, f_num applies to all words in the synset.
If non-zero, it is applicable only to the word indi-
cated. Word numbers are assigned as described for
pointers. Each f_num w_num pair is preceded by a +.
See wninput(5WN) for the text of the generic sentence
frames.
gloss Each synset contains a gloss. A gloss is represented as
a vertical bar (|), followed by a text string that con-
tinues until the end of the line. The gloss may contain
a definition, one or more example sentences, or both.
Sense Numbers
Senses in WordNet are generally ordered from most to least frequently
used, with the most common sense numbered 1. Frequency of use is de-
termined by the number of times a sense is tagged in the various seman-
tic concordance texts. Senses that are not semantically tagged follow
the ordered senses. The tagsense_cnt field for each entry in the in-
dex.pos files indicates how many of the senses in the list have been
tagged.
The cntlist(5WN) file provided with the database lists the number of
times each sense is tagged in the semantic concordances. The data from
cntlist is used by grind(1WN) to order the senses of each word. When
the index.pos files are generated, the synset_offsets are output in
sense number order, with sense 1 first in the list. Senses with the
same number of semantic tags are assigned unique but consecutive sense
numbers. The WordNet OVERVIEW search displays all senses of the speci-
fied word, in all syntactic categories, and indicates which of the
senses are represented in the semantically tagged texts.
Exception List File Format
Exception lists are alphabetized lists of inflected forms of words and
their base forms. The first field of each line is an inflected form,
followed by a space separated list of one or more base forms of the
word. There is one exception list file for each syntactic category.
Note that the noun and verb exception lists were automatically gener-
ated from a machine-readable dictionary, and contain many words that
are not in WordNet. Also, for many of the inflected forms, base forms
could be easily derived using the standard rules of detachment pro-
grammed into Morphy (See morph(7WN)). These anomalies are allowed to
remain in the exception list files, as they do no harm.
Verb Example Sentences
For some verb senses, example sentences illustrating the use of the
verb sense can be displayed. Each line of the file sentidx.vrb con-
tains a sense_key followed by a space and a comma separated list of ex-
ample sentence template numbers, in decimal. The file sents.vrb lists
all of the example sentence templates. Each line begins with the tem-
plate number followed by a space. The rest of the line is the text of
a template example sentence, with %s used as a placeholder in the text
for the verb. Both files are sorted alphabetically so that the
sense_key and template sentence number can be used as indices, via bin-
srch(3WN), into the appropriate file.
When a request for FRAMES is made, the WordNet search code looks for
the sense in sentidx.vrb. If found, the sentence template(s) listed is
retrieved from sents.vrb, and the %s is replaced with the verb. If the
sense is not found, the applicable generic sentence frame(s) listed in
frames is displayed.
NOTES
Information in the data.pos and index.pos files represents all of the
word senses and synsets in the WordNet database. The word, lex_id, and
lex_filenum fields together uniquely identify each word sense in Word-
Net. These can be encoded in a sense_key as described in sen-
seidx(5WN). Each synset in the database can be uniquely identified by
combining the synset_offset for the synset with a code for the syntac-
tic category (since it is possible for synsets in different data.pos
files to have the same synset_offset).
The WordNet system provide both command line and window-based browser
interfaces to the database. Both interfaces utilize a common library
of search and morphology code. The source code for the library and in-
terfaces is included in the WordNet package. See wnintro(3WN) for an
overview of the WordNet source code.
ENVIRONMENT VARIABLES (UNIX)
WNHOME Base directory for WordNet. Default is /usr/lo-
cal/WordNet-3.0.
WNSEARCHDIR Directory in which the WordNet database has been
installed. Default is WNHOME/dict.
REGISTRY (WINDOWS)
HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
Base directory for WordNet. Default is C:\Pro-
gram Files\WordNet\3.0.
FILES
index.pos database index files
data.pos database data files
*.vrb files of sentences illustrating the use of verbs
pos.exc morphology exception lists
SEE ALSO
grind(1WN), wn(1WN), wnb(1WN), wnintro(3WN), binsrch(3WN), wnin-
tro(5WN), cntlist(5WN), lexnames(5WN), senseidx(5WN), wninput(5WN),
morphy(7WN), wngloss(7WN), wngroups(7WN), wnstats(7WN).
WordNet 3.0 Dec 2006 WNDB(5WN)
Man Pages Copyright Respective Owners. Site Copyright (C) 1994 - 2024
Hurricane Electric.
All Rights Reserved.