wndb

WNDB(5WN)                   WordNettm File Formats                   WNDB(5WN)

NAME
       index.noun,  data.noun, index.verb, data.verb, index.adj, data.adj, in-
       dex.adv, data.adv - WordNet database files

       noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists

       sentidx.vrb, sents.vrb - files used by search code to display sentences
       illustrating the use of some specific verbs

DESCRIPTION
       For each syntactic category, two files are needed to represent the con-
       tents of the WordNet database - index.pos and data.pos,  where  pos  is
       noun,  verb,  adj  and  adv.  The other auxiliary files are used by the
       WordNet library's searching functions and are needed to run the various
       WordNet browsers.

       Each index file is an alphabetized list of all the words found in Word-
       Net in the corresponding part of speech.  On each line,  following  the
       word,  is  a list of byte offsets (synset_offsets) in the corresponding
       data file, one for each synset containing the word.  Words in the index
       file are in lower case only, regardless of how they were entered in the
       lexicographer files.  This folds various  orthographic  representations
       of  the word into one line enabling database searches to be case insen-
       sitive.  See wninput(5WN) for a detailed description of the  lexicogra-
       pher files

       A data file for a syntactic category contains information corresponding
       to the synsets that were specified in the lexicographer files, with re-
       lational pointers resolved to synset_offsets.  Each line corresponds to
       a synset.  Pointers are followed and hierarchies  traversed  by  moving
       from one synset to another via the synset_offsets.

       The  exception  list files, pos.exc, are used to help the morphological
       processor find base forms from irregular inflections.

       The files sentidx.vrb and sents.vrb contain sentences illustrating  the
       use  of  specific  senses  of  some verbs.  These files are used by the
       searching software in response to a request for verb  sentence  frames.
       Generic  sentence frames are displayed when an illustrative sentence is
       not present.

       The various database files are in ASCII formats that are easily read by
       both humans and machines.  All fields, unless otherwise noted, are sep-
       arated by one space character, and all lines are terminated by  a  new-
       line  character.  Fields enclosed in italicized square brackets may not
       be present.

       See wngloss(7WN) for a glossary of WordNet terminology and a discussion
       of the database's content and logical organization.

   Index File Format
       Each  index  file  begins with several lines containing a copyright no-
       tice, version number and license agreement.  These lines all begin with
       two spaces and the line number so they do not interfere with the binary
       search algorithm that is used to look up entries in  the  index  files.
       All  other  lines  are  in the following format.  In the field descrip-
       tions, number always refers to a decimal integer unless  otherwise  de-
       fined.

       lemma  pos  synset_cnt  p_cnt  [ptr_symbol...]  sense_cnt  tagsense_cnt   synset_offset  [synset_offset...]

       lemma          lower  case ASCII text of word or collocation.  Colloca-
                      tions are formed by joining individual words with an un-
                      derscore (_) character.

       pos            Syntactic  category: n for noun files, v for verb files,
                      a for adjective files, r for adverb files.

       All remaining fields are with respect to senses of lemma in pos.

       synset_cnt     Number of synsets that lemma is in.  This is the  number
                      of  senses of the word in WordNet. See Sense Numbers be-
                      low for a discussion of how sense numbers  are  assigned
                      and the order of synset_offsets in the index files.

       p_cnt          Number  of  different  pointers  that  lemma  has in all
                      synsets containing it.

       ptr_symbol     A space separated  list  of  p_cnt  different  types  of
                      pointers  that  lemma  has in all synsets containing it.
                      See wninput(5WN) for a list of pointer_symbols.  If  all
                      senses  of lemma have no pointers, this field is omitted
                      and p_cnt is 0.

       sense_cnt      Same as sense_cnt above.  This  is  redundant,  but  the
                      field was preserved for compatibility reasons.

       tagsense_cnt   Number  of  senses of lemma that are ranked according to
                      their frequency of occurrence  in  semantic  concordance
                      texts.

       synset_offset  Byte  offset  in  data.pos  file  of a synset containing
                      lemma.  Each synset_offset in the list corresponds to  a
                      different  sense  of lemma in WordNet.  synset_offset is
                      an 8 digit, zero-filled decimal integer that can be used
                      with fseek(3) to read a synset from the data file.  When
                      passed to read_synset(3WN) along with the syntactic cat-
                      egory,  a data structure containing the parsed synset is
                      returned.

   Data File Format
       Each data file begins with several lines containing a copyright notice,
       version  number  and license agreement.  These lines all begin with two
       spaces and the line number.  All other lines are in the following  for-
       mat.  Integer fields are of fixed length, and are zero-filled.

       synset_offset  lex_filenum  ss_type  w_cnt  word  lex_id  [word  lex_id...]  p_cnt  [ptr...]  [frames...]  |  gloss

       synset_offset  Current  byte  offset  in  the  file represented as an 8
                      digit decimal integer.

       lex_filenum    Two digit decimal integer corresponding to the  lexicog-
                      rapher  file  name  containing  the  synset.   See  lex-
                      names(5WN) for the list of filenames  and  their  corre-
                      sponding numbers.

       ss_type        One character code indicating the synset type:

                      n    NOUN
                      v    VERB
                      a    ADJECTIVE
                      s    ADJECTIVE SATELLITE
                      r    ADVERB

       w_cnt          Two  digit  hexadecimal integer indicating the number of
                      words in the synset.

       word           ASCII form of a word as entered in  the  synset  by  the
                      lexicographer,  with spaces replaced by underscore char-
                      acters (_).  The text of the word is case sensitive,  in
                      contrast  to  its  form  in  the corresponding index.pos
                      file, that contains only lower-case forms.  In data.adj,
                      a  word  is  followed  by  a syntactic marker if one was
                      specified in the lexicographer file.  A syntactic marker
                      is  appended,  in parentheses, onto word without any in-
                      tervening spaces.  See wninput(5WN) for a  list  of  the
                      syntactic markers for adjectives.

       lex_id         One  digit  hexadecimal integer that, when appended onto
                      lemma, uniquely identifies a sense within  a  lexicogra-
                      pher file.  lex_id numbers usually start with 0, and are
                      incremented as additional senses of the word  are  added
                      to  the same file, although there is no requirement that
                      the numbers be consecutive or begin with 0.  Note that a
                      value  of 0 is the default, and therefore is not present
                      in lexicographer files.

       p_cnt          Three digit decimal integer  indicating  the  number  of
                      pointers from this synset to other synsets.  If p_cnt is
                      000 the synset has no pointers.

       ptr            A pointer from this synset to another.  ptr  is  of  the
                      form:

                      pointer_symbol  synset_offset  pos  source/target

                      where  synset_offset  is  the  byte offset of the target
                      synset in the data file corresponding to pos.

                      The source/target field distinguishes lexical and seman-
                      tic  pointers.   It is a four byte field, containing two
                      two-digit hexadecimal integers.  The  first  two  digits
                      indicates  the  word  number  in  the  current  (source)
                      synset, the last two digits indicate the word number  in
                      the   target   synset.   A  value  of  0000  means  that
                      pointer_symbol represents a  semantic  relation  between
                      the  current (source) synset and the target synset indi-
                      cated by synset_offset.

                      A  lexical  relation  between  two  words  in  different
                      synsets  is represented by non-zero values in the source
                      and target word numbers.  The first and last  two  bytes
                      of  this  field  indicate the word numbers in the source
                      and target synsets, respectively, between which the  re-
                      lation  holds.   Word  numbers  are assigned to the word
                      fields in a synset, from left to right,  beginning  with
                      1.

                      See  wninput(5WN) for a list of pointer_symbols, and se-
                      mantic and lexical pointer classifications.

       frames         In data.verb only, a list of  numbers  corresponding  to
                      the  generic  verb  sentence  frames  for  words  in the
                      synset.  frames is of the form:

                      f_cnt   +   f_num  w_num  [ +   f_num  w_num...]

                      where f_cnt a two digit decimal integer  indicating  the
                      number  of  generic  frames listed, f_num is a two digit
                      decimal integer frame number, and w_num is a  two  digit
                      hexadecimal  integer  indicating  the word in the synset
                      that the frame applies to.  As with  pointers,  if  this
                      number  is 00, f_num applies to all words in the synset.
                      If non-zero, it is applicable only  to  the  word  indi-
                      cated.   Word  numbers  are  assigned  as  described for
                      pointers.  Each f_num  w_num pair is preceded  by  a  +.
                      See  wninput(5WN)  for  the text of the generic sentence
                      frames.

       gloss          Each synset contains a gloss.  A gloss is represented as
                      a  vertical bar (|), followed by a text string that con-
                      tinues until the end of the line.  The gloss may contain
                      a definition, one or more example sentences, or both.

   Sense Numbers
       Senses  in  WordNet are generally ordered from most to least frequently
       used, with the most common sense numbered 1.  Frequency of use  is  de-
       termined by the number of times a sense is tagged in the various seman-
       tic concordance texts.  Senses that are not semantically tagged  follow
       the  ordered  senses.  The tagsense_cnt field for each entry in the in-
       dex.pos files indicates how many of the senses in the  list  have  been
       tagged.

       The  cntlist(5WN)  file  provided with the database lists the number of
       times each sense is tagged in the semantic concordances.  The data from
       cntlist  is  used by grind(1WN) to order the senses of each word.  When
       the index.pos files are generated, the  synset_offsets  are  output  in
       sense  number  order,  with sense 1 first in the list.  Senses with the
       same number of semantic tags are assigned unique but consecutive  sense
       numbers.  The WordNet OVERVIEW search displays all senses of the speci-
       fied word, in all syntactic categories,  and  indicates  which  of  the
       senses are represented in the semantically tagged texts.

   Exception List File Format
       Exception  lists are alphabetized lists of inflected forms of words and
       their base forms.  The first field of each line is an  inflected  form,
       followed  by  a  space  separated list of one or more base forms of the
       word.  There is one exception list file for each syntactic category.

       Note that the noun and verb exception lists were  automatically  gener-
       ated  from  a  machine-readable dictionary, and contain many words that
       are not in WordNet.  Also, for many of the inflected forms, base  forms
       could  be  easily  derived  using the standard rules of detachment pro-
       grammed into Morphy (See morph(7WN)).  These anomalies are  allowed  to
       remain in the exception list files, as they do no harm.

   Verb Example Sentences
       For  some  verb  senses,  example sentences illustrating the use of the
       verb sense can be displayed.  Each line of the  file  sentidx.vrb  con-
       tains a sense_key followed by a space and a comma separated list of ex-
       ample sentence template numbers, in decimal.  The file sents.vrb  lists
       all  of the example sentence templates.  Each line begins with the tem-
       plate number followed by a space.  The rest of the line is the text  of
       a  template example sentence, with %s used as a placeholder in the text
       for the verb.   Both  files  are  sorted  alphabetically  so  that  the
       sense_key and template sentence number can be used as indices, via bin-
       srch(3WN), into the appropriate file.

       When a request for FRAMES is made, the WordNet search  code  looks  for
       the sense in sentidx.vrb.  If found, the sentence template(s) listed is
       retrieved from sents.vrb, and the %s is replaced with the verb.  If the
       sense  is not found, the applicable generic sentence frame(s) listed in
       frames is displayed.

NOTES
       Information in the data.pos and index.pos files represents all  of  the
       word senses and synsets in the WordNet database.  The word, lex_id, and
       lex_filenum fields together uniquely identify each word sense in  Word-
       Net.   These  can  be  encoded  in  a  sense_key  as  described in sen-
       seidx(5WN).  Each synset in the database can be uniquely identified  by
       combining  the synset_offset for the synset with a code for the syntac-
       tic category (since it is possible for synsets  in  different  data.pos
       files to have the same synset_offset).

       The  WordNet  system provide both command line and window-based browser
       interfaces to the database.  Both interfaces utilize a  common  library
       of search and morphology code.  The source code for the library and in-
       terfaces is included in the WordNet package.  See wnintro(3WN)  for  an
       overview of the WordNet source code.

ENVIRONMENT VARIABLES (UNIX)
       WNHOME              Base  directory  for  WordNet.  Default is /usr/lo-
                           cal/WordNet-3.0.

       WNSEARCHDIR         Directory in which the WordNet  database  has  been
                           installed.  Default is WNHOME/dict.

REGISTRY (WINDOWS)
       HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
                           Base  directory  for  WordNet.   Default is C:\Pro-
                           gram Files\WordNet\3.0.

FILES
       index.pos           database index files

       data.pos            database data files

       *.vrb               files of sentences illustrating the use of verbs

       pos.exc             morphology exception lists

SEE ALSO
       grind(1WN),  wn(1WN),  wnb(1WN),  wnintro(3WN),   binsrch(3WN),   wnin-
       tro(5WN),  cntlist(5WN),  lexnames(5WN),  senseidx(5WN),  wninput(5WN),
       morphy(7WN), wngloss(7WN), wngroups(7WN), wnstats(7WN).

WordNet 3.0                        Dec 2006                          WNDB(5WN)
Man Pages Copyright Respective Owners. Site Copyright (C) 1994 - 2024 Hurricane Electric. All Rights Reserved.