NAME
    swish-e - web indexing and retrieval system

SYNOPSIS
    swish-e -w *words* [-m *maxresults*] [-t *tags*] [-d
    *delimiter*] [-p *properties*] [-f *file* *file* ...] [-c
    *config*] [-M] [-l] [-D] [-V] [ -s *sortprop1* *sortprop2* ]
    [-b *startresults* ]

DESCRIPTION
    swish-e is Simple Web Indexing System for Humans - Enhanced.
    siwsh-e searches words and/or phrases in a previously created
    index of HTML pages or files. It returns a ranked output
    of file names whose contents match the words.

    Please note that this documentation is not complete. The
    definitve source can only be found on the web:
    http://sunsite.berkeley.edu/SWISH-E/.

OPTIONS
    -w *word word ...*
        This performs a case-insensitive search using a number of
        keywords. If no index file to search is specified, swish-e
        will try to search a file called index.swish-e in the
        current directory. See below for the syntax and semantics of
        kewwords.
        This may also used to search for a phrase. The phrase delimiter
        is ".
        eg: -w test   (searchs for word test)
        eg: -w 'test or "my phrase"' (search for word test and phrase
             "my phrase"). The ' is to avoid uncomfortable problems with
             your UNIX shell.

    -t HBthec
        The -t option allows you to search for words that exist only
        in specific HTML tags. Each character in the string you
        specify in the argument to this option represents a
        different tag to search for the word in. H means all HEAD
        tags, B stands for BODY tags, t is all TITLE tags, h is H1
        to H6 (header) tags, e is emphasized tags (this may be B, I,
        EM, or STRONG), and c is HTML comment tags

    -f *indexfile1 indexfile2 ...*
        If you are indexing, this specifies the file to save the
        generated index in, and you can only specify one file. If
        you are searching, this specifies the index files (one or
        more) to search from. The default index file is index.swish-
        e in the current directory.

    -p *property1 property2 ...*
        NOTE: it is necessary to have indexed with the proper
        PropertyNames directive in the user config file in order to
        use this option.

    -d *character*
        The delimiter string can be anything you like, although the
        special string "dq" will be interpreted to mean a single
        double quote character. To parse a line of the output in
        Perl (using the "dq" option) use:

                ($rank, $filename, $title, $filesize) = split(/\"/, $_);


    -m *number*
        While searching, this specifies the maximum number of
        results to return. The default is 40. If no numerical value
        is given, the default is assumed. If the value is 0 or the
        string all, there will be no limit to the number of results.
        The configuration file value overrides this value.

    -c *config-file*
        Start indexing using the parameters from *config-file*.

    -S [ fs | http ]
        Specify which indexing system to use: fs to index the file
        system, http index web sites using a web crawler.

    -l  Follow symbolic links when indexing.

    -M *file file ...*
        Merge indexing files.

    -D *index*
        Decode an index file.

    -V  Print the current version.

    -b *startresults*
        The results are printed from the one indicated by the numeric
        value *startresults*

    -s *sortprop1* *sortprop2* ...
        Sorts results by the properties *sortprop1* *sortprop2* ...
        instead using rank. Only descending sort is supported. The
        sort is case insensitive.


SEARCHING
  Boolean Operators

    You can use the booleans operators and, or, or not in searching.
    Without these booleans, swish-e will assume you're anding the
    words together. The operators are case sensitive -- use
    lowercase ONLY.

    Evaluation takes place from left to right only, although you can
    use parentheses to force the order of evaluation .

         % swish-e -w 'smilla or snow' -f myIndex


    retrieves files containing either the words "smilla" or "snow"

         % swish-e -w 'smilla and snow not sense' -f myIndex


    retrieves first the files that contain both the words "smilla"
    and "snow"; then among those the ones that do not contain the
    word "sense"

  Truncation

    The only wildcard available at this time is (*), however it can
    only be used at the end of a word. Usage at the beginning or in
    the middle of the word will yield no results.

         % swish-e -w 'librarian' -f myIndex

    this query only retrieves files which contain the given word.

    On the other hand:

         % swish-e -w 'librarian*' -f myIndex

    retrieves "librarians", "librarianship", etc. along with
    "librarian".

    The wilcard can be also used in phrase searching:

         % swish-e -w '"this is a samp* phra*"' -f myIndex



  Meta Tags

    The equal sign indicates the presence of a metaName and the
    search results in all the files where the META tag with
    NAME="metaName" has CONTENT="word" (or where "word" is contained
    in the area marked by the <!--META START... --> and <!--META
    END... --> tags).

    It is not necessary to have spaces at either side of the '=',
    consequently the following are equivalent:

         % swish-e -w 'metaName = word' -f

         % swish-e -w 'metaName=word' -f

         % swish-e -w 'metaName= word' -f


    To search on a word that contains a '=', have a '/' precede the
    '=':

         % swish-e -w 'test/=3 = x/=4 or y/=5' -f <index.file>


    this query returns the files where the word "x=4" is associated
    with the metaName "test=3" or that contains the word "y=5" not
    associated with any metaName.

    Queries can be also constructed using any of the usual search
    features, moreover metaName and plain search can be mixed in a
    single query.

         % swish-e -w 'metaName1 = (a1 or a4) not (a3 and a7)' -f yyy


    This query will retrieve all the files in which the "metaName1"
    is associated either with "a1" or "a4" and that do not contain
    the words "a3" and "a7", where "a3" and "a7" are not associated
    to any meta name.


  Order of Evaluation

    Expressions are always evaluated left to right:

         % swish -w 'juliet not ophelia and pac' -f myIndex


    retrieves files which contain "juliet" and "pac" but not
    "ophelia" However it is always possible to force the order of
    evaluation by using parenthesis. For example:

         % swish-e -w 'juliet not (ophelia and pac)' -f myIndex


    retrieves files with "juliet" and containing neither "ophelia"
    nor "pac".

  Context

    At times you might not want to search for a word in every part
    of your files since you know that the word(s) are present in a
    particular tag. The ability to seach according to context
    greatly increases the chances that your hits will be relevant,
    and swish-e provides a mechanism to do just that.

    The -t option in the search command line allows you to search
    for words that exist only in specific HTML tags. Each character
    in the string you specify in the argument to this option
    represents a different tag in which the word is searched; that
    is you can use any combinations of the following charactes:

    H   means all HEAD tags

    B   stands for BODY tags

    t   is all TITLE tags

    h   is H1 to H6 (header) tags

    e   is emphasized tags (this may be B, I, EM, or STRONG)

    c   is HTML comment tags (<!-- ... -->)


  Examples

            swish-e -w 'apples oranges' -t t -f myIndex


    This search will look for files with these two words in their
    titles only.

            swish-e -w 'keywords draft release' -t c -f myIndex


    This search will look for files with these words in comments
    only.

            swish-e -w 'world wide web' -t the -f myIndex


    This search will look for words in titles, headers, and
    emphasized tags.

CONFIG FILE
    Some Basic Variables in the User Configuration File. If not
    otherwise specified, all directives are used by both the
    FILESYSTEM and HTTP methods

    IndexDir *directory*
        The IndexDir variable tells swish-e what directories and
        files to index. Each specified directory will be indexed
        recursively. You can use more than one of these directives -
        here are some examples:

                IndexDir /usr/local/www /src/code.html
                IndexDir /users/tony/public_html/home.html /web


        For the HTTP method specify the url's from which the
        spidering need to start.

    IndexFile *indexfile*
        The IndexFile variable tell swish-e what to save the indexed
        results as. Indexes generated by swish-e should have a
        suffix of .swish-e.

    IndexOnly *.suffix1 .suffix2 .suffix3 ...*
        Only files with these suffixes will be indexed. If you omit
        this variable, swish-e will index every file it comes
        across. Suffix checking is not case sensitive. This
        directive in only available for the FILESYSTEM method.

    PropertyNames *author*
        List of names that can be retrieved with the -p option.
        Index size increases as by the formula in the manual.
        Comment out if no PropertyNames

    UseStemming no
        Set this directive to yes if you would like stemming

    IndexReport *3*
        This variable can have the values 0 to 3. If you specify 3,
        swish-e will tell you what's going on while it's indexing,
        printing out directory and file names, number of words
        indexed, and so on, as well as give information about other
        operations. The value 0 will make swish-e completely silent.

    FollowSymLinks yes|no
        Normally swish-e ignores symbolic links to files when
        indexing. If you want it to follow such links, define this
        value as yes, else define it as no.

    NoContents *.suffix1 .suffix2 .suffix3 ...*
        This variable lets you control what files will have their
        contents indexed. If a file with a suffix in this list is
        indexed, only its file name (and not any words in the file)
        will be indexed. This is useful because normally swish-e
        will try to index the contents of every file, even files
        without words (such as images or movies). Suffix checking is
        case-insensitive.

    IgnoreWords *word1 word2 ...*
        Here you can specify words to ignore when searching. Usually
        these words (called stopwords) are words that occur too many
        times in your data to make indexing them worthwhile. If you
        specify a word as SwishDefault, it will be replaced with
        swish-e's default list - a few hundred very common English
        words.
        It is also possible to specify an external stopwords file using:
        IgnoreWords File:/path/file

    IgnoreLimit *number1 number2*
        After indexing, swish-e can automatically tell which words
        are the most common and omit them from the index according
        to these parameters. Here are some examples:

                IgnoreLimit 50 50


        Swish will ignore all words that occur in over 80% of the
        files and that also occur in over 256 different files.

                IgnoreLimit 80 256


        Swish will ignore all words that occur in over 50% of the
        files and that also occur in over 50 different files.

        Using IgnoreLimit and IgnoreWords can help trim the size of
        your index files considerably - experiment with parameters
        to see what works best at your site. You can also use
        IgnoreLimit to limit the CPU resources that searches take.
        This option is not longer used by default.

    IndexName *value*
    IndexDescription *value*
    IndexPointer *value*
    IndexAdmin *value*
        These variables specify information that goes into index
        files to help users and administrators. IndexName should be
        the name of your index, like a book title. IndexDescription
        is a short description of the index or a URL pointing to a
        more full description. IndexPointer should be a pointer to
        the original information, most likely a URL. IndexAdmin
        should be the name of the index maintainer and can include
        name and email information. These values should not be more
        than 70 or so characters and should be contained in quotes.
        Note that the automatically generated date in index files is
        in D/M/Y and 24-hour format.

    MetaNames *name1 name2*
        These variables specify the meta names used in the .html
        files. Do not comment out or erase this line. MetaNames need
        to be one word with no quotes.
        There is a reserved word here: automatic. If it is specified
        the index engine will try to extract all the metanames
        dinamically.

    WordCharacters
          abcdefghijklmnopqrstuvwxyz&#;0123456789.@|,-'[](~!@$%^{}_+?
        Wordchars is a string of characters which swish-e permits to
        be in words. Any strings which do not include these
        characters will not be indexed. You can choose from any
        character in the following string:

               abcdefghijklmnopqrstuvwxyz0123456789_|/-+=?!@$%^'`~,.[]{}()


        Note that if you omit 0123456789&#; you will not be able to
        index HTML entities. DO NOT use the asterisk (*), lesser
        than and greater than signs, or colon (:). Including any of
        these four characters may cause funny things to happen.
        NOTE: Do not escape nor and they cannot be the first letter
        in the string Commenting out the line will give the defaults
        If not set it defaults to the value in config.h

    BeginCharacters *string*
        Of the characters that you decide can go into words, this is
        a list of characters that words can begin with. It should be
        a subset of (or equal to) WordCharacters Same rule of syntax
        as for WordCharacters If not set it defaults to the value in
        config.h

    EndCharacters *string*
        Of the characters that you decide can go into words, this is
        a list of characters that words can begin with. It should be
        a subset of (or equal to) WordCharacters Same rule of syntax
        as for B{WordCharacters>. If not set it defaults to the
        value in config.h

    IgnoreLastChar *string*
        Array that contains the char that, if considered valid in
        the middle of a word need to be disreguarded when at the
        end. It is important to also set the given char's in the
        ENDCHARS array, otherwise the word will not be indexed
        because considered invalid. Commenting out the line will
        give the defaults NOTE: if is the first char in the string
        it needs to be escaped with Do not escape otherwise

    IgnoreFirstChar *string*
        Array that contains the char that, if considered valid in
        the middle of a word need to be disreguarded when at the
        beginning. This was to solve the problem of parenthesis when
        there is no space between ( and the beginning of the word.
        Remember to add the char's to the BEGINCHARS list also.
        Commenting out the line will give the defaults NOTE: if a
        double quote is the first char in the string it needs to be
        escaped with \ .Do not escape otherwise.

    MaxDepth *number*
        (default 5) This defines how many links the spider should
        follow before stopping. A value of 0 configures the spider
        to traverse all links. This directive is only available for
        the HTTP method.

    Delay *seconds*
        The number of seconds (default 60) to wait between issuing
        requests to a server. This directive is only available for
        the HTTP method.

    TmpDir *dir*
        The location (default /var/tmp) of a writeable temp
        directory on your system. The HTTP access method tells the
        Perl helper to place its files there. This directive is only
        available for the HTTP method.

    SpiderDirectory *dir*
        The location (default ./) of the Perl helper script.
        Remember, if you use a relative directory, it is relative to
        your directory when you run swish-e, not to the directory
        that swish-e is in. This directory is only available for the
        HTTP method.

    EquivalentServer *hostname hostname ...*
        (default nothing) This allows you to deal with servers that
        use respond to multiple DNS names. Each line should have a
        list of all the method/names that should be considered
        equivalent. If you have multiple directives, each one
        defines its own set of equivalent servers. This directive is
        only available for the HTTP method.

    TranslateCharacters *strin1* *string2*
        This is useful to translate some characters in
        the words. It takes two strings: The original characters and
        the translated characters.
        Example:

        TranslateCharacters -! /?

        This makes word "1-2" indexed as "1/2" and "9!1" as "9?1"
        Remember that all the chars int these strings must also be in
        WordCharacters.
        This option is useful for non english languages, specially if
        anybody wants accuted vowels to be indexed without the accute       

XML Support
    New in version 2.0. Only partial, see README-2.0

FILTER OPTION
    From a patch to 1.3.2, it is included in 2.0. See README-FILTERS

IMPORTANT NOTE
    See README-2.0 and README-FILTERS included in the package to see
    all addons to version 2.0

EXAMPLE CONFIG FILE
    See conf directory in the distribution

AUTHOR
    SWISH was created by Kevin Hughes. In Fall 1996, The Library of
    UC Berkeley received permission from Kevin Hughes to implement
    bug fixes and enhancements to the original binary. The result is
    SWISH-Enhanced or swish-e, brought to you by the swish-e
    Development Team.

SEE ALSO
    The definitive (and more complete) documentation can be found
    here:

        http://sunsite.berkeley.edu/SWISH-E/

