Einzelnen Beitrag anzeigen

Benutzerbild von himitsu
himitsu

Registriert seit: 11. Okt 2003
Ort: Elbflorenz
44.071 Beiträge
 
Delphi 12 Athens
 
#3

Re: ePCRE (himi's TRegEx)

  Alt 13. Apr 2010, 10:26
Sooo, jetzt wo ich wieder mehr Zeit hab und vorallem endlich mal weiß wo dieser fieße Fehler sich versteckte, welchen ich einfach nicht fand,
> fehlerhafte Referenzzählung? (Record + dyn. Array)
kann es nun weitergehn.

Schön, wenn man 'ne gute Hand voller Anwendung im Code umbauen/patchen muß, weil Delphi einfach nicht richtig funktioniert.
(Das Projekt hier ist aktuell wichtiger, aber mir graut schon davor, wenn ich im himXML nachsehn muß, ob sich da nicht auch solche "Fallen" verstecken, da dort ebenfalls einige dieser Records verbaut sind )


Zitat von himitsu:
Es ist nicht unbedingt leicht dieses umzusetzen.
Hab jetzt erstmal die hierfür schon vorhandenen Codes auskommentiert und werde es aktuell auch nicht weiterverfolgen (falls sich niemand findet, welcher sowas benötigt).

Und nochmal zum Unicode:
Diese Klasse wird komplett nur auf Unicode ausgelegt sein, besitzt aber für den Notfall einen Konverter.
Delphi-Quellcode:
Class Function Convert(Const Expr: RawByteString; SourceEncoding: TEncoding = nil): UnicodeString;
Class Function Convert(Const Expr: UnicodeString; DestEncoding: TEncoding = nil): RawByteString;
Wenn es dann mal läuft, wird noch eine separate SingleByte-Version davon erstellt und für MultiByte-Zeichensätze, wie z.B. UTF-8, eine Umleitung zum Unicode eingerichtet.
Und es wird alles nur für Delphi 2009 oder höher geben.
(eine alternative Anpassung, bis auf D2006/TDE runter, ist noch offen und noch weiter runter wir nicht möglich sein)


Der aktuelle Inhalt meiner RegExp-Definition (Zeilenenden etwas abgeschnitten ... Rest siehe RegEx.txt da oben)
Wie gesagt, falls jemand Fehler oder Verbesserungen entdeckt ... bitte frühzeitig melden.
Code:
description
   『patt』             pattern
   「patt」             alternative
   〔patt 1║patt 2║…〕  alternative group
   【name】             see description "name"
   〈…〉                -

expression
   『【delimiter】【pattern】【delimiter】【modifiers】』

delimiter
   A delimiter can be any non-alphanumeric, non-whitespace character, but ...
   Often used delimiters are forward slashes (/), hash signs (#) and tilde...

   The delimiters as in order of their statistical use: /#~!@%°=&

modifiers
   『「【set】」「-【reset】」』

   Values for 【set】 and 【reset】 are group of the following characters:
   i         remCaseLess       Do case-insensitive pattern matching.
   m         remMultiLine      Treat string as multiple lines. That is, ...
   A   (2)  remAnchored       *
   D   (2)  remDollarEndOnly  *(ignored if modifier "m" is set)
   s         remSingleLine     Treat string as single line. That is, cha...
   S   (1)                     *Ausführung steigern
   U         remUngreedy       *Gier unterdrücken
   x         remExtended       Extend your pattern's legibility by permi...
   u   (1)                     *UTF-8 interpretiert
   p   (1)  (preserve)        Preserve the string matched such that ${^...
   g   (1)  (global)          Global matching

   1)  not supported
   2)  not allowed as pattern in extendet groups

pattern syntax - meta-characters:
   『\…』    general escape character with several uses
   『(…)』   subpattern
   『…|…』   alternative patterns
   『.』     match any character except newline (by default)
   『^』     assert start of subject (or line, in multiline mode)
   『$』     assert end of subject (or line, in multiline mode)
   『[…]』   character class
   『…?』    0 or 1 quantifier (or quantifier minimizer)
   『…*』    0 or more quantifier
   『…+』    1 or more quantifier
   『…{…}』  min/max quantifier
   『#…』    comment - only if modifier "x" is set

   If used this characters, this must be delimited.

meta-characters in character classes:
   『\…』    general escape character
   『^』     negate the class, but only if the first character
   『-』     indicates character range
   『[:…:]』 POSIX character class

delimited characters and classes
   \0         null or Octal character code
   \1 to \9   back reference
   \a        bell (alert)
   \A        text start
   \b \B     word boundary
   \c        control character
   \C        single character
   \d \D     decimal digit
   \e        escape
   \E        end of quote (\Q, \L and \U)
   \f        form feed
   \g        back reference
   \G        matches start
   \h \H     horizontal space characters
   \k        named back reference
   \K        keep the left stuff
   \l \L     lowercase characters
   \n        new line
   \N        named unicode character
   \p \P     named property
   \Q        quote
   \r        carrige return
   \R        newline sequence
   \s \S     space
   \t        tabulator
   \u \U     uppercase characters
   \v \V     vertical space characters
   \w \W     word characters
   \x        heXadecimal character code
   \X        eXtended unicode sequence
   \z        text end
   \Z        text end or end of last line
   \<        start of word
   \>        end of word

   The followed characters must be delimited if they are to be used.
      \ ( ) | . ^ $ [ ? * + {
      #   (if modifier "x" is set)

characters
   &#12302;\0&#12304;digit&#12305;&#12303;               octal character code
   &#12302;\x&#12304;x-digit&#12305;&#12304;x-digit&#12305;&#12303;  heXadecimal character code (Ansi)
   &#12302;\x{&#12304;x-digits&#12305;}&#12303;          heXadecimal character code (Unicode)
   &#12302;\c&#12304;character&#12305;&#12303;           control char
   &#12302;\N{&#12304;name&#12305;}&#12303;              named unicode character

   supported names
      U+xxxx                     hexadecimal character code

named character class (named unicode properties)
   &#12302;\p&#12304;character&#12305;&#12303;
   &#12302;\p{&#12304;name&#12305;}&#12303;     for names of only one letter
   &#12302;\P&#12304;character&#12305;&#12303;  any characters but not this
   &#12302;\P{&#12304;name&#12305;}&#12303;     any characters but not this

   supported classes
      IsCntrl, IsSpace, IsSpacePerl, IsDigit, IsXDigit, IsUpper, IsLower,
      IsAlpha, IsAlnum, IsWord, IsPunct, IsGraph, IsPrint, IsASCII

   supported scripts
      Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Bu...
      Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, Cypriot, C...
      Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, ...
      Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada...
      Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam, Mongolian...
      New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Ph...
      Phoenician, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, ...
      Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi

   supported general category property codes
      C    other
      Cc   control
      Cf   format
      Cn   unassigned
      Co   private use
      Cs   surrogate

      L    letter
      Ll   lower case letter - specifying caseless matching does not affe...
      Lm   modifier letter
      Lo   other letter
      Lt   title case letter - specifying caseless matching does not affe...
      Lu   upper case letter - specifying caseless matching does not affe...

      M    mark
      Mc   spacing mark
      Me   enclosing mark
      Mn   non-spacing mark

      N    number
      Nd   decimal number
      Nl   letter number
      No   other number

      P    punctuation
      Pc   connector punctuation
      Pd   dash punctuation
      Pe   close punctuation
      Pf   final punctuation
      Pi   initial punctuation
      Po   other punctuation
      Ps   open punctuation

      S    symbol
      Sc   currency symbol
      Sk   modifier symbol
      Sm   mathematical symbol
      So   other symbol

      Z    separator
      Zl   line separator
      Zp   paragraph separator
      Zs   space separator

character class
   &#12302;[&#12300;^&#12301;&#12304;character list&#12305;&#12300;&#12304;character list&#12305;…&#12301;]&#12303;

   character list
      &#12302;&#12304;character&#12305;&#12303;                single character or delimited char...
      &#12302;&#12304;character&#12305;-&#12304;character&#12305;&#12303;  range of characters
      &#12302;\&#12304;class&#12305;&#12303;                   delimited class
      &#12302;[:&#12304;POSIX&#12305;:]&#12303;                POSIX character class

   ^   inverts the class

POSIX character class
   &#12302;[&#12300;^&#12301;:&#12304;name&#12305;:]&#12303;:

   this can used only in a character class ( […] )

   supported classes
      cntrl, space, blank, digit, xdigit, upper, lower,
      alpha, alnum, punct, graph, print

group
   &#12302;(&#12304;pattern&#12305;)&#12303;

named group
    &#12302;(?&#12300;P&#12301;<&#12304;name&#12305;>&#12304;pattern&#12305;)&#12303;

modifier change (extendet group)
    &#12302;(?&#12304;modifiers&#12305;)&#12303;

extendet group
    &#12302;(?&#12300;&#12304;modifiers&#12305;&#12301;:&#12304;pattern&#12305;)&#12303;

look-ahead
    &#12302;(?&#12300;&#12304;modifiers&#12305;&#12301;=&#12304;pattern&#12305;)&#12303;

negative look-ahead
    &#12302;(?&#12300;&#12304;modifiers&#12305;&#12301;!&#12304;pattern&#12305;)&#12303;

look-behind
    &#12302;(?&#12300;&#12304;modifiers&#12305;&#12301;<=&#12304;pattern&#12305;)&#12303;

negative look-behind
    &#12302;(?&#12300;&#12304;modifiers&#12305;&#12301;<!&#12304;pattern&#12305;)&#12303;

recursive subpattern
    &#12302;(?&#12300;-&#9553;+&#12301;&#12304;number&#12305;)&#12303;
    &#12302;(?R)&#12303;
    &#12302;(?P>&#12304;name&#12305;)&#12303;
    &#12302;(?P&&#12304;name&#12305;)&#12303;

   clones the pattern (not the result) of a previous group

   (?R) = (?0)

conditional subpattern
   &#12302;(?(&#12304;condition&#12305;)&#12304;yes-pattern&#12305;&#12300;|&#12304;no-pattern&#12305;&#12301;)&#12303;

  condition
      &#12302;&#12300;-&#9553;+&#12301;&#12304;number&#12305;&#12303;
      &#12302;R&#12303;
      &#12302;{&#12304;name&#12305;}&#12303;
      &#12302;&#12304;pattern&#12305;&#12303;

back references
   &#12302;\&#12304;digit&#12305;&#12303;             for the references 1 to 9
   &#12302;\g&#12304;digit&#12305;&#12303;
   &#12302;\g{&#12300;-&#9553;+&#12301;&#12304;number&#12305;}&#12303;
   &#12302;\g&#12304;character&#12305;&#12303;        for names of only one letter
   &#12302;\g{&#12304;name&#12305;}&#12303;

named back references
   &#12302;\k<&#12304;name&#12305;>&#12303;
   &#12302;\k'&#12304;name&#12305;'&#12303;
   &#12302;\k{&#12304;name&#12305;}&#12303;

comments
    &#12302;(?#&#12304;text&#12305;)&#12303;
    &#12302;#&#12304;text&#12305;([\r\n]|$)&#12303;  (1)

   non in character sets

   1)  only if modifier "e" is set

quantifier
    &#12302;&#12304;pattern&#12305;?&#12300;?&#9553;+&#12301;&#12303;      einmal oder garnicht    equivalent to ...
    &#12302;&#12304;pattern&#12305;*&#12300;?&#9553;+&#12301;&#12303;      garnicht oder mehrmals  equivalent to ...
    &#12302;&#12304;pattern&#12305;+&#12300;?&#9553;+&#12301;&#12303;      mindestens einmal       equivalent to ...
    &#12302;&#12304;pattern&#12305;{n}&#12300;?&#9553;+&#12301;&#12303;    n-mal
    &#12302;&#12304;pattern&#12305;{n,}&#12300;?&#9553;+&#12301;&#12303;   mindestens n-mal
    &#12302;&#12304;pattern&#12305;{n,m}&#12300;?&#9553;+&#12301;&#12303;  n-mal bis m-mal

characters and character classes:
   .   any character - if multiple lines are not activated then doesn't m...
   \0   null character
   \a  bell (alert #7)
   \n  new line (#10)
   \f  form feed (#13)
   \e  escape {#27}
   \t  tabulator (#9)
   \h  horizontal space characters
   \v  vertical space characters
   \r  carrige return (#13)
   \R  newline sequence
   \d  decimal digit
   \w  word character
   \s  space
   \X  eXtended unicode sequence
   \C  single char - one character or a part of surrogate pairs

   \H  any character but none horizontal space characters
   \V  any character but an vertical space characters
   \D  any character but not a decimal digit
   \W  any character but an word character
   \S  any character but a space

control classes:
   ^    line start
   $    line end
   \A  text start
   \G  matches start
   \z  text end
   \Z  text end or end of last line
   \b  word boundary
   \B  not a word boundary
   \<  start of word
   \>  end of word
   \l  lowercase next char
   \u  uppercase next char
   \L  lowercase till \E
   \U  uppercase till \E
   \Q  quote (disable) pattern metacharacters till \E
   \E  end of quote (\Q, \L and \U)
   \K  keep the stuff left of the \K, don't include it in result

options
   reoSplitNoEmpty                  If this flag is set, then from SPLIT ...
   reoSplitDelimCapture             If this flag is set, then be parenthe...
   reoOffsetCapture                 If this flag is set, then returned wi...
   reoSplitSetCapture               Orders results so that $array[0] an a...
   default (no reoSplitSetCapture)  Orders results so that $array[0] an a...
   reoCustomizeLinebreaks

related character classes and sets

   DESCRIPTION      POSIX        PERL FN          PERL PERL           ...
                                                                          ...
   ---------------   -----------   ---------------   --   ----------------...
   any char                                         .   [^\n\r]
   control          [:cntrl:]    \p{IsCntrl}            [\x00-\x1F\x7F] ...
   white space+tab  [:blank:]    \p{IsSpace}            [ \t]          ...
   whitespace                     \p{IsSpace}       \s  [ \f\t\v]
   whitespacePerl   [:space:]    \p{IsSpacePerl}        [ \f\n\r\t\v]  ...
   punctuation      [:punct:]    \p{IsPunct}            [!-/:-@[-`{-~] ...
   decimal digit    [:digit:]    \p{IsDigit}       \d  [0-9]          ...
   hexadecimal      [:xdigit:]   \p{IsXDigit}           [0-9A-Fa-f]    ...
   upper            [:upper:]    \p{IsUpper}       \u  [A-Z]          ...
   lower            [:lower:]    \p{IsLower}       \l  [a-z]          ...
   upper+lower      [:alpha:]    \p{IsAlpha}            [A-Za-z]       ...
   alphanumeric     [:alnum:]    \p{IsAlnum}            [A-Za-z0-9]    ...
   alphanumeric+_    [:word:]     \p{IsWord}        \w  [A-Za-z0-9_]   ...
   printable        [:graph:]    \p{IsGraph}            [!-~]          ...
   printable+space  [:print:]    \p{IsPrint}            [ -~]          ...
   any ASCII        [:ascii:]    \p{IsASCII}            [\x00-\xFF]    ...
   any Unicode                                           [\x00-\x{FFFF}]



   [:punct:]   []!"#$%&\'()*+,./:;<=>?@\\^_`{|}~[-]
   [:xdigit:]  [[:digit:]A-Fa-f]
   [:alpha:]   [[:upper:][:lower:]]
   [:alnum:]   [[:alpha:][:digit:]]
   [:word:]    [[:alnum:]_]
   [:graph:]   [[:word:][:punct:]]
   [:print:]   [ [:graph:]]
Neuste Erkenntnis:
Seit Pos einen dritten Parameter hat,
wird PoSex im Delphi viel seltener praktiziert.
  Mit Zitat antworten Zitat