About regexp of SCAN METHOD

Last update: 1999.03.01

SCANメソッドの正規表現

SCANメソッドは、SEARCHメソッドと異なり、それぞれのレコードのデータの文字列と直接照合を行い、検索キーとマッチするデータを持つレコードを検索する。検索キーとは、SCANメソッドのObjedt-Bodyで指定する「value」（「CATP/1.0 仕様書第2版 4.6 SCANメソッド 4.6.1.4 Object-Body」を参照）のことである。
検索キーを指定する際、正規表現を使用することができる。

正規表現の構文

正規表現では、

「\」（円記号 = YEN SIGN:5C）、
「^」（アクサンシルコンフレックス、サーカムフレックス = CIRCUMFLEX ACCENT:5E）、
「$」（ドル記号 = DOLLAR SIGN: 24）、
「|」（縦線 = VERTICAL LINE: 7C）、
「.」（ピリオド = FULL STOP: 2E）、
「+」（正符号、正記号 =PLUS SIGN: 2B）、
「*」（アステリスク、アスタリスク = ASTERISK: 2A）、
「?」（疑問符 = QUESTION MARC: 3F）、
「[」（左大括弧、始め大括弧 = LEFT SQUARE BRACKET: 5B）、
「]」（右大括弧、終り大括弧 = RIGHT SQUARE BRACKET: 5D）、
「-」（ハイフン、負記号、マイナス = HYPHEN-MINUS: 2D）、
「(」（左小括弧、始め小括弧 =LEFT PARENTHESIS: 28）、
「)」（右小括弧、終り小括弧 = RIGHT PARENTHESIS: 29）

を特殊な意味を持つ文字として扱う。

また、

「&quot」（引用符 = QUOTATION MARK: 22）

は、SCANメソッドに引き渡す際、検索キーを「&quot」で括るため、正規表現ではないがSCANメソッドでは、特別な意味を持つ。

正規表現の内部処理ルーチンの構文は、以下の方法で、入力された検索キーが、検索対象フィールドのデータとマッチするかしないかを判断し、マッチしているレコードが検索される。

正規表現は0以上のブランチからなる。各ブランチは「|」で区切られる。正規表現全体は各ブランチのどれか1つでもマッチすればマッチする。

ブランチは0個以上のピースの連なりである。
ブランチは、最初のピースにマッチし、かつ、2番目のピースにマッチし、と全ピースに対してその順でマッチした場合マッチする。

ピースはアトムか、アトムの後ろに「*」, 「+」, または「?」がついたものである。

「*」がついた場合、ピースはアトムの0回以上の連なりにマッチする。
「+」がついた場合、ピースはアトムの1回以上の連なりにマッチする。
「?」がついた場合、ピースはヌルストリングまたはアトムにマッチする。

アトムは

「(」「)」でくくられた正規表現、
レンジ（以下参照）、
「.」（任意の1文字にマッチ）、
「^」（入力文字列の先頭にマッチ）、
「$」（入力文字列の終端にマッチ）、
「\」に続く任意の1字（その1字にマッチ。特にその文字が特殊な意味を持つ場合。）、
特殊な意味のない任意の1字（その字にマッチ）

である。
「.」は、任意の1文字にマッチするのであって、1バイトにマッチするわけでないことに注意。

レンジは「[」「]」でくくられた文字列である。

レンジは通常その中の任意の1字とマッチする。
「[]」中の文字列が「^」で始まる場合は、レンジは「^」をのぞく文字列の中にない任意の1字とマッチする。
文字列中に「-」が現われるときこれは、この「-」の両側の文字の間の値の内部コード（wchar_tタイプで）を持つすべての文字を並べたものとして解釈される。
例えば、「[0-9]」は任意の数字1字とマッチする。
文字「]」をレンジの中にいれたいときはこれを「[」（またはそれに続く「^」）の次の最初の文字とすればよい。
文字「-」そのものをレンジの中にいれたいときは、最初か最後に置けばよい。

なお、「[]」中では、「\」はエスケープシーケンスとして働かないことに注意。

また、「^$」を指定すると、指定されたフィールドを持たないレコードが検索される。これは、指定されたフィールドがグループフィールドの場合、そのグループフィールド中に指定されたフィールドが一つも存在しないレコードが検索される。

シンタクスエラーなど以外に、「^+」のように、「+」または「*」の対象が空列になりうる場合、内部の処理ルーチンでエラーを発生する。

SCANメソッドでは、検索キーの正規化は行なわない。入力した検索キーを正規化して検索対象フィールドのデータ中の文字列と照合したい場合は、正規表現を使用して、SCANメソッドに検索キーを引き渡す必要がある。

SCANメソッドでは、上記の特殊な意味を持つ文字（記号）以外のデリミタは、特別な指示をすることなく、検索キーとして指定できる。スペースも同様である。

入力例（Object-Bodyのみを例示）

特定方法	検索結果集合作成の際に使用した検索キー	SCANメソッドのObject-Bodyの入力例	検索されるレコード例（○）、および検索されないレコード例（×）
先頭文字列の指定	TITLEKEY= " SPACE " TITLEKEY= " TIME " AND	TRD=" ^A"	TRD=A Treatise on time and space / J. R. Lucas （○） TRD=Space-time structure / by Erwin Schrodinger（×）
末尾文字列の指定	TITLEKEY=" GOOD"	TRD=" ing$"	TRD=Good housekeeping（○） TRD=Critic and good literature（×）
前後一致検索（任意の1文字）	SHKEY=" UNESCO"	TRD=" rgani.ation"	TRD=Twenty-fifth anniversary of the organization, 4-5 ...（○） TRD=Das ordentliche Haushalts- und Finanzwesen der Organisation ...（○） TRD=UNESCO; twenty years of service to peace, ...（×）
前後一致検索（任意の複数文字。文字数指定）	TITLEKEY=" ニホン" TITLEKEY=" ブンガク" AND	TRD=" 日本..文学"	TRD=日本古典文学史 / 乾安代〔ほか〕著（○） TRD=日本近世文学論 / 森田喜郎著（○） TRD=日本文学史 / ドナルド・キーン著 ; 徳岡孝夫訳（×）
前後一致検索（任意の複数文字。文字数0以上）	TITLEKEY=" チュウセイ" TITLEKEY=" ブンガク" AND	TRD=" 中世.*文学"	TRD=中世文学の研究 / 秋山虔編（○） TRD=中世日本文学と時間意識 / 永藤靖著（○） TRD=ヨーロッパ文学とラテン中世 / E.R. クルツィウス〔著〕 ; 南大路振一...（×）
デリミタを含んだ検索（スペース）	TITLEKEY=" ANTARCTIC*" TITLEKEY=" METEORITES" AND	TRD=" Antarctic Meteorites" （単語の間にスペースが入っています）	TRD=Proceedings of the fourth Symposium on Antarctic Meteorites ...（○） TRD=Yamato meteorites collected in Antarctica in 1969 ...（×）
デリミタを含んだ検索（/）	AUTHKEY=" US" AUTHKEY=" USSR" AUTHKEY=" SYMPOSIUM" AND	HDNGD=" US/USSR"	HDNGD=US/USSR Environmental Economics Symposium（○） HDNGD=Joint US-USSR Symposium on Hypertension（×）
デリミタを含んだ検索（&amp）	TITLEKEY=" RIGHT" TITLEKEY=" LEFT" AND	TRD=" &"	TRD=Left, right & babyboom : America's new politics ... （○） TRD=Left brain, right brain / Sally P. Springer, Georg ... （×）
デリミタを含んだ検索（カッコ）	AUTHKEY=" マツモト" AUTHKEY=" ヒロシ" AND	HDNGD=" \\(" （注）	HDNGD=松本, 博(1921-)（○） HDNGD=松本, 博（×）
データの有無：EDフィールドのない書誌の検索		ED=" ^$"

（注）SCANメソッドで特殊な意味を持つ文字（「(」、「$」など）自体を、正規表現の対象とする場合、「\」（円記号 = YEN SIGN:5C）を重ねて指定する必要がある。これは、SCANメソッドと正規表現の処理モジュールのそれぞれにエスケープシーケンスとして「\」を引き渡す必要があるためである。

正規表現の曖昧な点について

ある正規表現が検索対象文字列の2つ以上の部分にマッチしうる場合、最初に出てきたほうとマッチする。双方とも同じ場所から始まって、マッチする部分の長さが異なるか、あるいは長さが同じでも異なる方法でマッチする場合、以下のようになる。

原則としてブランチは左の方が優先される。「*」, 「+」, 「?」については長いほうが優先される。ネストしたものでは外側が優先し、ピースの連なりでは左のものが優先する。この優先度で最初にマッチしたものが選ばれる。

たとえば、「(ab|a)b*c」は「abc」に対し2つの方法でマッチしうる。最初の選択肢は「ab」か「a」かであるが、「ab」の方が左にあり、しかも全体でもマッチするのでこちらが選択される。ここで「b」が使われてしまったため、「b*」は空列にマッチするしかない。最初の選択が優先するためである。、

特に、「|」がなく、「*」,「+」,「?」が1つしかない場合、最長一致となる。「ab*」は「xabbbby」に対して「abbbb」でマッチする。しかし「ab*」は「xabyabbbz」に対しては「x」のすぐ後ろの「ab」にマッチすることに注意すること（最初に出てきた方にマッチの原則による）。

履歴

SCANメソッドの内部で使用している処理ルーチンは Henry Spencer の regexp(3) をwchar_t（ワイド文字列）化したものである。

Henry Spencer の regexp(3)については、以下の文章を参照のこと。

This is a wchar_t version of regexp(3), modified from Henry Spencer's
V8 regexp, by A. MIYAZAWA (NACSIS).  Original permission notice is
also applied to this version.

Files are
Makefile        instructions to make everything
wregexp.3       manual page
wregexp.3.japanese      manual page in japanese
wregexp.h       header file, for /usr/include
wregexp.c       source for regcomp() and regexec()
wregsub.c       source for regsub()
wregerror.c     source for default regerror()
wregmagic.h     internal header file
try.c           source for test program
tests           test list for try and timer
testk           test list for JLE characters

------ ORIGINAL README FOLLOWS -------------------------------------------
This is a nearly-public-domain reimplementation of the V8 regexp(3) package.
It gives C programs the ability to use egrep-style regular expressions, and
does it in a much cleaner fashion than the analogous routines in SysV.

        Copyright (c) 1986 by University of Toronto.
        Written by Henry Spencer.  Not derived from licensed software.

        Permission is granted to anyone to use this software for any
        purpose on any computer system, and to redistribute it freely,
        subject to the following restrictions:

        1. The author is not responsible for the consequences of use of
                this software, no matter how awful, even if they arise
                from defects in it.

        2. The origin of this software must not be misrepresented, either
                by explicit claim or by omission.

        3. Altered versions must be plainly marked as such, and must not
                be misrepresented as being the original software.

Barring a couple of small items in the BUGS list, this implementation is
believed 100% compatible with V8.  It should even be binary-compatible,
sort of, since the only fields in a "struct regexp" that other people have
any business touching are declared in exactly the same way at the same
location in the struct (the beginning).

This implementation is *NOT* AT&T/Bell code, and is not derived from licensed
software.  Even though U of T is a V8 licensee.  This software is based on
a V8 manual page sent to me by Dennis Ritchie (the manual page enclosed
here is a complete rewrite and hence is not covered by AT&T copyright).
The software was nearly complete at the time of arrival of our V8 tape.
I haven't even looked at V8 yet, although a friend elsewhere at U of T has
been kind enough to run a few test programs using the V8 regexp(3) to resolve
a few fine points.  I admit to some familiarity with regular-expression
implementations of the past, but the only one that this code traces any
ancestry to is the one published in Kernighan & Plauger (from which this
one draws ideas but not code).

Simplistically:  put this stuff into a source directory, copy regexp.h into
/usr/include, inspect Makefile for compilation options that need changing
to suit your local environment, and then do "make r".  This compiles the
regexp(3) functions, compiles a test program, and runs a large set of
regression tests.  If there are no complaints, then put regexp.o, regsub.o,
and regerror.o into your C library, and regexp.3 into your manual-pages
directory.

Note that if you don't put regexp.h into /usr/include *before* compiling,
you'll have to add "-I." to CFLAGS before compiling.

The files are:

Makefile        instructions to make everything
regexp.3        manual page
regexp.h        header file, for /usr/include
regexp.c        source for regcomp() and regexec()
regsub.c        source for regsub()
regerror.c      source for default regerror()
regmagic.h      internal header file
try.c           source for test program
timer.c         source for timing program
tests           test list for try and timer

This implementation uses nondeterministic automata rather than the
deterministic ones found in some other implementations, which makes it
simpler, smaller, and faster at compiling regular expressions, but slower
at executing them.  In theory, anyway.  This implementation does employ
some special-case optimizations to make the simpler cases (which do make
up the bulk of regular expressions actually used) run quickly.  In general,
if you want blazing speed you're in the wrong place.  Replacing the insides
of egrep with this stuff is probably a mistake; if you want your own egrep
you're going to have to do a lot more work.  But if you want to use regular
expressions a little bit in something else, you're in luck.  Note that many
existing text editors use nondeterministic regular-expression implementations,
so you're in good company.

This stuff should be pretty portable, given appropriate option settings.
If your chars have less than 8 bits, you're going to have to change the
internal representation of the automaton, although knowledge of the details
of this is fairly localized.  There are no "reserved" char values except for
NUL, and no special significance is attached to the top bit of chars.
The string(3) functions are used a fair bit, on the grounds that they are
probably faster than coding the operations in line.  Some attempts at code
tuning have been made, but this is invariably a bit machine-specific.