All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
ParsingToolkit.h
Go to the documentation of this file.
1 /**
2  * @file icaruscode/PMT/Algorithms/ParsingToolkit.h
3  * @brief Simple text parsing utilities.
4  * @author Gianluca Petrillo (petrillo@slac.stanford.edu)
5  * @date May 13, 2022
6  * @see icaruscode/PMT/Algorithms/ParsingToolkit.cxx
7  */
8 
9 #ifndef ICARUSCODE_PMT_ALGORITHMS_PARSINGTOOLKIT_H
10 #define ICARUSCODE_PMT_ALGORITHMS_PARSINGTOOLKIT_H
11 
12 // C/C++ standard libraries
13 #include <algorithm> // std::count_if()
14 #include <istream>
15 #include <stdexcept> // std::runtime_error
16 #include <vector>
17 #include <initializer_list>
18 #include <string>
19 #include <string_view>
20 #include <utility> // std::pair, std::move()
21 #include <cctype> // std::isblank()
22 #include <cassert>
23 
24 
25 // -----------------------------------------------------------------------------
26 namespace icarus { struct ParsingToolkit; }
27 /**
28  * @brief Utilities for text parsing.
29  *
30  * This "class" is a glorified namespace with some configuration inside.
31  *
32  *
33  * Quotation
34  * ----------
35  *
36  * A quoted string is the content in between an opening quoting sequence and
37  * the matching closing sequence. Each sequence may be any string, including
38  * but not limited to a one-character long string. Escaping the first character
39  * of an opening or closing quotation string will turn it in common string data
40  * carrying no quotation meaning.
41  *
42  *
43  * Escaping rules
44  * ---------------
45  *
46  * Any single character following the escape character is "escaped".
47  * The escaped characters lose their standard function and are replaced by a
48  * substitute character. For example, escaping the first character of a opening
49  * quotation makes that a standard character. An escaped escape character is
50  * always replaced by the character itself, without its escape function.
51  *
52  *
53  */
55 
56  /// Base type for errors in the toolkit.
57  struct Error;
58 
59  /// Record of a split token: pre-separator, separator and post-separator.
60  struct SplitView_t { std::string_view pre, sep, post; };
61 
62  /// Specification of quotation: opening and closing.
63  using QuotSpec_t = std::pair<std::string, std::string>;
64 
65  /// All parsing parameters.
66  struct Params_t {
67 
68 
69  char escape { '\\' }; ///< Escape character.
70 
71  std::string comment { "#" }; ///< Word introducing a comment.
72 
73  char EOL { '\n' };
74 
75  /// List of matching start and end of quote.
76  std::vector<QuotSpec_t> quotes {
77  QuotSpec_t{ R"(")", R"(")" },
78  QuotSpec_t{ R"(')", R"(')" }
79  };
80 
81  }; // Params_t
82 
83  // Adapter converting argument of functions like `std::isblank()` properly.
84  template <int (*CCTF)(int)>
85  struct CCTypeAdapter {
86  template <typename Ch>
87  constexpr bool operator() (Ch c) const noexcept
88  { return CCTF(static_cast<unsigned char>(c)); }
89  }; // CCTypeAdapter
90 
91  /// Adapter for determining if a character is a blank (see `std::isblank()`).
93 
94 
95  // --- BEGIN --- Initialization ----------------------------------------------
96 
97  static Params_t const DefaultParameters; /// Default parsing parameters.
98 
99  /// Creates a parser with the default parsing parameters.
101 
102  /// Creates a parser with the specified parsing parameters.
103  ParsingToolkit(Params_t params) { adoptParams(std::move(params)); }
104 
105  // --- END ----- Initialization ----------------------------------------------
106 
107 
108  // --- BEGIN --- Query -------------------------------------------------------
109 
110  /// Returns the current parameters of parsing.
111  Params_t const& params() const noexcept { return fParams; }
112 
113  // --- END ----- Query -------------------------------------------------------
114 
115 
116  // --- BEGIN --- Input -------------------------------------------------------
117  /// @name Input
118  /// @{
119 
120  /**
121  * @brief Returns a single line of text from the input stream.
122  * @param in the input stream
123  * @return the string read, and the number of lines read
124  * @throw Error on fatal parsing errors
125  *
126  * This function reads entire lines from `in`, where a line is defined as
127  * in `std::getline()`. If the line ends with an unescaped escape character,
128  * another line is read and appended (the escape character is dropped).
129  * The return value is the merged string with no end-of-line characters,
130  * and the number of lines read.
131  * If there is no string to be read, it returns an empty string and `0U`.
132  *
133  * ### Special behaviour
134  *
135  * * If the line ends while a quotation is still open, the next line is also
136  * merged, and the line break is kept; to merge quoted lines without
137  * preserving the line break character, end the quote on the first line,
138  * immediately break the line escaping it, and then next line should
139  * immediately start with opening a quotation.
140  * * If the line ends while a quotation is still open, it is a parsing error
141  * to have the line break character escaped (an exception will be thrown)
142  * merged, and the line break is kept.
143  * * If the file ends while a quotation is still open, the line is preserved
144  * as such.
145  */
146  std::pair<std::string, unsigned int> readMultiline(std::istream& in) const;
147 
148  /// @}
149  // --- END ----- Input -------------------------------------------------------
150 
151  // --- BEGIN --- Tokenization ------------------------------------------------
152  /// @name Tokenization
153  /// @{
154  /**
155  * @brief Splits a string into words.
156  * @tparam Delim type of delimiter functor
157  * @param s the string to be split
158  * @param isDelimiter (default: `isblank()`) determines if a character is a
159  * word delimiter
160  * @return a sequence of views, one per word
161  *
162  * The splitter algorithm defines a word separator as a sequence of one or
163  * more unescaped, unquoted delimiter characters, where a delimiter is a
164  * character `ch` for which `isDelimiter(ch)` is `true`.
165  *
166  * Note that this function does not change the content of the data, and in
167  * particular it does not remove escaping nor quoting (although it interprets
168  * both).
169  *
170  * A character used as delimiter can appear in a word only if escaped or
171  * within quotation. Contiguous non-delimiter elements of a string, including
172  * quoted strings, belong to the same word (for example, `a" and "b` is a
173  * single word when delimitation is by blank characters).
174  * An empty word can be introduced only in quotations (e.g. `""`).
175  *
176  * The `Delim` type is a functor so that `isDelimiter(ch)` returns something
177  * convertible to `bool`, `true` if the `ch` character should be considered
178  * a delimiter. Note that no context is provided for the answer, so the
179  * use of each character as delimiter is fixed, and modified only by the
180  * hard-coded quotation and escaping rules.
181  *
182  * The first characters of quotation starts and the escape characters must not
183  * be classified as delimiters, or the algorithm will give wrong results.
184  */
185  template <typename Delim>
186  std::vector<std::string_view> splitWords
187  (std::string const& s, Delim isDelimiter) const;
188 
189  /// Helper version of `splitWords(std::string const&, Delim)`.
190  std::vector<std::string_view> splitWords(std::string const& s) const
191  { return splitWords(s, isBlank); }
192 
193 
194  /**
195  * @brief Finds the first word starting with a comment marker.
196  * @tparam Iter type of iterator to the words
197  * @param beginWord iterator to the first word to consider
198  * @param endWord iterator past the lasy word to consider
199  * @return an iterator to the comment word, or `endWord` if not found
200  *
201  * The original list is modified, the word starting with a comment marker and
202  * all the following ones are removed.
203  */
204  template <typename Iter>
205  Iter findCommentWord(Iter beginWord, Iter endWord) const;
206 
207 
208  /**
209  * @brief Removes all the words from the one starting with a comment marker.
210  * @param words list of words
211  *
212  * The original list is modified, the word starting with a comment marker and
213  * all the following ones are removed.
214  */
215  template <typename WordType>
216  void removeCommentLine(std::vector<WordType>& words) const
217  { words.erase(findCommentWord(words.begin(), words.end()), words.end()); }
218 
219 
220  /**
221  * @brief Finds the start of the next quotation in `sv`.
222  * @param sv the buffer to look the quotation start into
223  * @return a subview of `sv` starting from the quotation found, empty if none
224  */
225  std::pair<std::string_view, QuotSpec_t const*> findQuotationStart
226  (std::string_view sv) const;
227 
228  /**
229  * @brief Finds the quotation end in `sv`.
230  * @param sv the buffer to look the quotation end into
231  * @param quotEnd the quotation end to be searched
232  * @return a view of `sv` from the quotation end, included, empty if not found
233  *
234  * Note that `sv` should not include the quotation start.
235  */
236  std::string_view findQuotationEnd
237  (std::string_view sv, std::string const& quotEnd) const;
238 
239  /// Returns if the sequence `sv` has unclosed quotation at its end.
240  bool isQuotationUnclosed(std::string_view sv) const;
241 
242  /**
243  * @brief Finds the first of the specified keys in `sv`.
244  * @tparam BIter type of iterator to the keys
245  * @tparam EIter type of key end-iterator
246  * @param sv string to be parsed
247  * @param beginKey iterator to the first key
248  * @param endKey iterator past the last key
249  * @return a view of the key found within `sv`, empty if none
250  *
251  * The `keys` are required to be sorted, longest first, since they are tested
252  * in order and the first match is kept (e.g. if the first key is `=` and the
253  * second is `==`, the second key is never matched since the first one matches
254  * first).
255  * The first character of the key must not be escaped. Escaped characters in
256  * the key are not supported.
257  *
258  * If no `key` is found, the returned view is zero-length and pointing to the
259  * end of `sv`.
260  *
261  * The quoting in `sv` is ignored.
262  */
263  template <typename BIter, typename EIter>
264  std::string_view findFirstUnescaped
265  (std::string_view sv, BIter beginKey, EIter endKey) const;
266 
267  // @{
268  /**
269  * @brief Finds the first of the specified keys in `sv`.
270  * @tparam BIter type of iterator to the keys
271  * @tparam EIter type of key end-iterator
272  * @param sv string to be parsed
273  * @param beginKey iterator to the first key
274  * @param endKey iterator past the last key
275  * @return a view of the key found within `sv`, empty if none
276  *
277  * The `keys` are required to be sorted, longest first, since they are tested
278  * in order and the first match is kept (e.g. if the first key is `=` and the
279  * second is `==`, the second key is never matched since the first one matches
280  * first).
281  * The first character of the key must not be escaped. Escaped characters in
282  * the key are not supported.
283  *
284  * If no `key` is found, the returned view is zero-length and pointing to the
285  * end of `sv`.
286  *
287  * The quoting in `sv` is ignored.
288  */
289  template <typename Keys>
290  std::string_view findFirstUnescaped
291  (std::string_view sv, Keys const& keys) const;
292 
293  template <typename Key>
294  std::string_view findFirstUnescaped
295  (std::string_view sv, std::initializer_list<Key> keys) const;
296 
297  //@}
298 
299 
300  /**
301  * @brief Finds the first of the specified keys in the unquoted part of `sv`.
302  * @tparam BIter type of iterator to the keys
303  * @tparam EIter type of key end-iterator
304  * @param sv string to be parsed
305  * @param beginKey iterator to the first key
306  * @param endKey iterator past the last key
307  * @return the view pointing to the key in `sv`, or empty to its end if none
308  *
309  * The `keys` are required to be sorted, longest first, since they are tested
310  * in order and the first match is kept (e.g. if the first key is `=` and the
311  * second is `==`, the second key is never matched since the first one matches
312  * first).
313  *
314  * If no `key` is found, the returned view is zero-length and pointing to the
315  * end of `sv`.
316  */
317  template <typename BIter, typename EIter>
318  std::string_view findFirstUnquoted
319  (std::string_view sv, BIter beginKey, EIter endKey) const;
320 
321  // @{
322  /**
323  * @brief Finds the first of the specified keys in the unquoted part of `sv`.
324  * @tparam BIter type of iterator to the keys
325  * @tparam EIter type of key end-iterator
326  * @param sv string to be parsed
327  * @param beginKey iterator to the first key
328  * @param endKey iterator past the last key
329  * @return the view pointing to the key in `sv`, or empty to its end if none
330  *
331  * The `keys` are required to be sorted, longest first, since they are tested
332  * in order and the first match is kept (e.g. if the first key is `=` and the
333  * second is `==`, the second key is never matched since the first one matches
334  * first).
335  *
336  * If no `key` is found, the returned view is zero-length and pointing to the
337  * end of `sv`.
338  */
339  template <typename Keys>
340  std::string_view findFirstUnquoted
341  (std::string_view sv, Keys const& keys) const;
342 
343  template <typename Key>
344  std::string_view findFirstUnquoted
345  (std::string_view sv, std::initializer_list<Key> keys) const;
346 
347  //@}
348 
349 
350  /**
351  * @brief Splits the view `sv` in three: before `sep`, `sep` and after `sep`.
352  * @param sv view of the string to split
353  * @param sep a subview of `sv` to split at
354  * @return a `SplitView_t` object with the three parts split, empty if needed
355  *
356  * The view `sep` is required to be a subview of `sv`: it's not enough for it
357  * to have as content a substring of `sv`. For example, `splitOn("a:1", ":")`
358  * will not work, because the string `"a:1"` does not share data in memory
359  * with `":"`.
360  *
361  * Even if `sep` is empty, it's still required to point with both `begin()`
362  * and `end()` within `sv`, and `sv` will be split according to that point.
363  */
364  static SplitView_t splitOn(std::string_view sv, std::string_view sep);
365 
366  /// @}
367  // --- END ----- Tokenization ------------------------------------------------
368 
369 
370  // --- BEGIN --- Characters --------------------------------------------------
371  /// @name Characters
372  /// @{
373 
374  /// Returns whether `ch` is an escape character.
375  bool isEscape(char ch) const { return ch == fParams.escape; }
376 
377  /**
378  * @brief Returns whether the character pointed by `itCh` is escaped or not.
379  * @tparam BIter iterator type
380  * @param begin iterator to the beginning of the string
381  * @param itCh iterator to the character to be investigated.
382  * @return whether there is an unescaped escape character before `itCh`
383  *
384  * Note that `itCh` may be a end iterator (for an empty string, the result
385  * is `false`).
386  */
387  template <typename BIter>
388  bool isCharacterEscaped(BIter begin, BIter itCh) const;
389 
390  /**
391  * @brief Finds the next character satisfying the specified criterion.
392  * @tparam Sel type of functor determining which character to consider blank
393  * @param s view of the string to be parsed
394  * @param select functor determining which character(s) to look for
395  * @return an iterator to the first character, `s.end()` if none
396  *
397  * By default, the selected character is a blank character `ch`, which has
398  * `std::isblank(ch)` `true`.
399  */
400  template <typename Sel>
401  std::string_view::const_iterator findNextCharacter
402  (std::string_view s, Sel select) const;
403 
404  /// Helper function for `findNextCharacter(std::string_view, Sel)`.
405  std::string_view::const_iterator findNextBlank(std::string_view s) const
406  { return findNextCharacter(s, isBlank); }
407 
408  /**
409  * @brief Consumes the blank characters a the beginning of `s`.
410  * @tparam CType type of functor determining which type of character to remove
411  * @param s view of the string to be parsed
412  * @param charType functor determining which characters to remove
413  * @return a view of `s` starting after its trailing `charType` characters
414  * @see `removeTrailingBlanks()`
415  */
416  template <typename CType>
417  std::string_view removeTrailingCharacters
418  (std::string_view s, CType charType) const;
419 
420  /// @brief Consumes the blank characters a the beginning of `s`.
421  /// @see `removeTrailingCharacters()`
422  std::string_view removeTrailingBlanks(std::string_view s) const
423  { return removeTrailingCharacters(s, isBlank); }
424 
425 
426  /**
427  * @brief Returns a copy of `w` with all escape characters removed.
428  * @param w the string to change
429  * @return a copy of `w` without escaping
430  * @see `removeEscapes(Word const&)`
431  *
432  * The escaping scheme that is applied is just to remove the escape
433  * character (no replacement table supported here).
434  * An unescaped escape character at the end of the string will not be removed.
435  *
436  * It is recommended that this be done as the last step of the parsing, since
437  * it changes the meaning of the parsing elements like quotations, comments
438  * etc.
439  *
440  * Note that applying `removeEscapes()` more than once will keep removing
441  * characters that in the earlier passes were not considered escapes (for
442  * example, four escape characters become two in the first pass, one in the
443  * second and disappear in the following passes).
444  *
445  */
446  std::string removeWordEscapes(std::string&& w) const;
447  std::string removeWordEscapes(std::string_view w) const
448  { return removeWordEscapes(std::string{ w }); }
449  std::string removeWordEscapes(const char* w) const
450  { return removeWordEscapes(std::string{ w }); }
451  // @}
452 
453  /**
454  * @brief Returns a copy of `words` with all escape characters removed.
455  * @tparam Words type of list of words
456  * @param words the list of words to change
457  * @return the list of words without escaping
458  * @see `removeEscapes(std::string)`
459  *
460  * The escaping is removed from each of the `words` in the list, which are
461  * treated as independent.
462  * See `removeEscapes(std::string)` for the details.
463  */
464  template <typename Words>
465  std::vector<std::string> removeEscapes(Words const& words) const;
466 
467 
468  //@{
469  /**
470  * @brief Returns a copy of `w` with no quotation starts and ends.
471  * @param w the string to change
472  * @return the word without quotations
473  * @see `removeQuotations(Words const&)`
474  *
475  * Escaping is still honored (if present).
476  *
477  * Note that applying `removeQuotations` more than once will keep removing
478  * quotation markings that in the earlier passes were not considered such (for
479  * example, `a1 << "b1 << 'c1 << " or " << c2' << b2" << a2` will become first
480  * `a1 << b1 << 'c1 << or << c2' << b2 << a2`, and eventually
481  * `a1 << b1 << c1 << or << c2 << b2 << a2`).
482  *
483  */
484  std::string removeWordQuotations(std::string&& w) const;
485  std::string removeWordQuotations(std::string_view w) const
486  { return removeWordQuotations(std::string{ w }); }
487  std::string removeWordQuotations(const char* w) const
488  { return removeWordQuotations(std::string{ w }); }
489  // @}
490 
491  /**
492  * @brief Returns a copy of `words` with no quotation starts and ends.
493  * @tparam Words type of list of words
494  * @param words the list of words to change
495  * @return the list of words without quotations
496  * @see `removeQuotations(std::string)`
497  *
498  * The substitution is applied on each of the `words` in the list, which are
499  * treated as independent.
500  * See `removeQuotations(std::string)` for the details.
501  */
502  template <typename Words>
503  std::vector<std::string> removeQuotations(Words const& words) const;
504 
505  /// @}
506  // --- END ----- Characters --------------------------------------------------
507 
508 
509  /// Creates a `std::string_view` from an entire string `s`.
510  static std::string_view make_view(std::string const& s)
511  { return make_view(s.begin(), s.end()); }
512 
513  /// Creates a `std::string_view` from two string iterators `b` and `e`.
514  template <typename BIter, typename EIter>
515  static std::string_view make_view(BIter b, EIter e)
516  { return { &*b, static_cast<std::size_t>(std::distance(b, e)) }; }
517 
518  private:
519  Params_t fParams; ///< Parsing parameters.
520 
521  // --- BEGIN -- Cache --------------------------------------------------------
522 
523  /// Start characters of all supported quotations.
524  std::string fQuoteStarts;
525 
526  // --- END ---- Cache --------------------------------------------------------
527 
528  /// Initializes the parameters and caches.
530 
531 }; // icarus::ParsingToolkit
532 
533 
534 // -----------------------------------------------------------------------------
535 struct icarus::ParsingToolkit::Error: std::runtime_error {
536  Error(std::string msg): std::runtime_error{ std::move(msg) } {}
537 };
538 
539 
540 // -----------------------------------------------------------------------------
541 // --- template implementation
542 // -----------------------------------------------------------------------------
543 template <typename Delim>
544 std::vector<std::string_view> icarus::ParsingToolkit::splitWords
545  (std::string const& s, Delim isDelimiter /* = isBlank */) const
546 {
547  // REQUIREMENT: escape character must not be classified as delimiter
548  assert(!isDelimiter(fParams.escape));
549  // REQUIREMENT: the first character of no quotation start must be classified
550  // as delimiter
551  assert(
552  std::count_if(fQuoteStarts.cbegin(), fQuoteStarts.cend(), isDelimiter) == 0
553  );
554 
555 
556  // helper class:
557  // stores the word as collected so far, updates `sv` and starts new words
558  class WordTracker {
559  ParsingToolkit const& tk;
560  Delim const& isDelimiter;
561  std::string_view& sv;
562  std::vector<std::string_view> words;
563  std::string_view::const_iterator wStart;
564  public:
565  WordTracker(ParsingToolkit const& tk, Delim const& d, std::string_view& sv)
566  : tk{ tk }, isDelimiter{ d }, sv{ consumeDelim(sv) }, wStart{ sv.begin() }
567  {}
568  void startNew()
569  {
570  words.push_back(make_view(wStart, sv.begin()));
571  wStart = consumeDelim().begin();
572  }
573  void moveEndTo(std::string_view::const_iterator it)
574  { moveEndBy(it - sv.begin()); }
575  void moveEndBy(std::size_t n) { sv.remove_prefix(n); }
576  std::vector<std::string_view> finish()
577  { if (wStart != sv.begin()) startNew(); return std::move(words); }
578  std::string_view& consumeDelim(std::string_view& s) const
579  { return s = tk.removeTrailingCharacters(s, isDelimiter); }
580  std::string_view& consumeDelim() { return consumeDelim(sv); }
581  }; // WordTracker
582 
583  std::string_view sv = make_view(s);
584  WordTracker words { *this, isDelimiter, sv }; // shares sv management
585 
586  // sv.begin() is kept updated to the candidate end of word;
587  // the beginning of the current word is always cached as words.wStart
588  while (!sv.empty()) {
589 
590  // process up to the next quotation
591  auto const [ qsv, qptr ] = findQuotationStart(sv);
592 
593  // parse and split until the quotation start:
594  auto const qstart = qsv.begin();
595  while(true) {
596 
597  // find next space;
598  // if next space is past the quotation, stop to the quotation instead
599  words.moveEndTo
600  (findNextCharacter(make_view(sv.begin(), qstart), isDelimiter));
601 
602  if (sv.begin() == qstart) break;
603 
604  // not the quote? it's a delimiter! new word found:
605  words.startNew();
606 
607  } // while(true)
608 
609  // handle the quoted part
610  if (qptr) {
611  assert(sv.substr(0, qptr->first.length()) == qptr->first);
612 
613  words.moveEndBy(qptr->first.length());
614 
615  // find the end of the quote, and swallow it into the current word
616  std::string_view const quotEnd = findQuotationEnd(sv, qptr->second);
617  words.moveEndTo(quotEnd.begin());
618 
619  // if we have found a end of quote, swallow it too (otherwise it's over)
620  if (!quotEnd.empty()) words.moveEndBy(qptr->second.length());
621 
622  } // if quotation found
623 
624  } // while
625 
626  return words.finish();
627 
628 } // icarus::ParsingToolkit::splitWords()
629 
630 
631 // -----------------------------------------------------------------------------
632 template <typename BIter, typename EIter>
634  (std::string_view sv, BIter beginKey, EIter endKey) const
635 {
636 
637  typename std::iterator_traits<BIter>::value_type const* key = nullptr;
638  std::size_t keyPos = std::string_view::npos;
639 
640  for (auto iKey = beginKey; iKey != endKey; ++iKey) {
641  // find where this key is (unescaped)
642  std::size_t pos = 0;
643  while (pos < sv.length()) {
644  pos = sv.find(*iKey, pos);
645  if (!isCharacterEscaped(sv.begin(), sv.begin() + pos)) break;
646  ++pos;
647  }
648  // is this the first among the keys?
649  if (pos >= std::min(keyPos, sv.length())) continue;
650  key = &*iKey;
651  keyPos = pos;
652  } // for keys
653 
654  // return a substring of sv, not key
655  if (key) {
656  using std::begin, std::end;
657  std::size_t const keyLength = make_view(*key).length();
658  return { sv.data() + keyPos, keyLength };
659  }
660  else return { sv.data() + sv.length(), 0 };
661 } // icarus::ParsingToolkit::findFirstUnescaped()
662 
663 
664 // -----------------------------------------------------------------------------
665 template <typename Keys>
667  (std::string_view sv, Keys const& keys) const
668 {
669  using std::begin, std::end;
670  return findFirstUnescaped(sv, begin(keys), end(keys));
671 } // icarus::ParsingToolkit::findFirstUnescaped(Keys)
672 
673 
674 // -----------------------------------------------------------------------------
675 template <typename Key>
677  (std::string_view sv, std::initializer_list<Key> keys) const
678  { return findFirstUnescaped(sv, keys.begin(), keys.end()); }
679 
680 
681 // -----------------------------------------------------------------------------
682 template <typename BIter, typename EIter>
684  (std::string_view sv, BIter beginKey, EIter endKey) const
685 {
686 
687  // if a key is found between `b` and `e`, returns `sv` split around the key;
688  // otherwise, all `sv` is in post
689  auto findKey = [this,beginKey,endKey]
690  (std::string_view::const_iterator b, std::string_view::const_iterator e)
691  { return findFirstUnescaped(make_view(b, e), beginKey, endKey); };
692 
693  std::string_view key{ sv.data() + sv.length(), 0 };
694  while (!sv.empty()) {
695 
696  // find the next quotation
697  auto const [ fromQ, qptr ] = findQuotationStart(sv);
698 
699  // search in the unquoted part
700  key = findKey(sv.begin(), fromQ.begin());
701  if (!key.empty()) break;
702 
703  // skip the quotation; if there is no quotation, we are done
704  if (!qptr) break;
705 
706  sv = fromQ;
707  sv.remove_prefix(qptr->first.length()); // skip the quotation start
708 
709  // find the end of quotation
710  std::string_view const afterQ = findQuotationEnd(sv, qptr->second);
711 
712  if (afterQ.empty()) { // begin of quotation, but no end: no good
713  // so we don't consider this as quotation: search in the "quoted" part
714  key = findKey(fromQ.begin(), fromQ.end());
715  break;
716  } // if
717 
718  // skip the quoted material, and the quotation end too
719  sv = afterQ;
720  sv.remove_prefix(qptr->second.length());
721 
722  } // while
723 
724  return key;
725 
726 } // icarus::ParsingToolkit::findFirstUnquoted(Iter)
727 
728 
729 // -----------------------------------------------------------------------------
730 template <typename Keys>
732  (std::string_view sv, Keys const& keys) const
733 {
734  using std::begin, std::end;
735  return findFirstUnquoted(sv, begin(keys), end(keys));
736 } // icarus::ParsingToolkit::findFirstUnquoted(Keys)
737 
738 
739 // -----------------------------------------------------------------------------
740 template <typename Key>
742  (std::string_view sv, std::initializer_list<Key> keys) const
743  { return findFirstUnquoted(sv, keys.begin(), keys.end()); }
744 
745 
746 // -----------------------------------------------------------------------------
747 template <typename Iter>
748 Iter icarus::ParsingToolkit::findCommentWord(Iter beginWord, Iter endWord) const
749 {
750  for (auto it = beginWord; it != endWord; ++it) {
751  if (std::equal(fParams.comment.begin(), fParams.comment.end(), begin(*it)))
752  return it;
753  } // for
754  return endWord;
755 } // icarus::ParsingToolkit::findCommentWord()
756 
757 
758 // -----------------------------------------------------------------------------
759 template <typename Iter>
761 {
762  unsigned int nEscapes = 0U;
763  while (itCh-- != begin) {
764 
765  if (!isEscape(*itCh)) break;
766  ++nEscapes;
767 
768  } // while
769 
770  return (nEscapes & 1) == 1; // odd number of escapes means escaped
771 
772 } // icarus::ParsingToolkit::isCharacterEscaped()
773 
774 
775 // -----------------------------------------------------------------------------
776 template <typename Sel>
777 std::string_view::const_iterator icarus::ParsingToolkit::findNextCharacter
778  (std::string_view s, Sel selector) const
779 {
780  auto const sbegin = s.begin(), send = s.end();
781  auto it = sbegin;
782  while (it != send) {
783  it = std::find_if(it, send, selector);
784  if (!isCharacterEscaped(sbegin, it)) return it;
785  ++it; // skip the escaped character and move on
786  } // while
787  return send;
788 } // icarus::ParsingToolkit::findNextCharacter()
789 
790 
791 // -----------------------------------------------------------------------------
792 template <typename CType>
794  (std::string_view s, CType charType) const
795 {
796  // REQUIREMENT: escape character must not be classified as delimiter
797  assert(!charType(fParams.escape));
798 
799  while (!s.empty()) {
800  if (!charType(s.front())) break; // escape character triggers this too
801  s.remove_prefix(1U);
802  } // while
803  return s;
804 } // icarus::ParsingToolkit::removeTrailingCharacters()
805 
806 
807 // -----------------------------------------------------------------------------
808 template <typename Words>
809 std::vector<std::string> icarus::ParsingToolkit::removeEscapes
810  (Words const& words) const
811 {
812  using std::size;
813  std::vector<std::string> nv;
814  nv.reserve(size(words));
815  for (auto const& word: words) nv.push_back(removeWordEscapes(word));
816  return nv;
817 } // icarus::ParsingToolkit::removeEscapes()
818 
819 
820 // -----------------------------------------------------------------------------
821 template <typename Words>
822 std::vector<std::string> icarus::ParsingToolkit::removeQuotations
823  (Words const& words) const
824 {
825  using std::size;
826  std::vector<std::string> nv;
827  nv.reserve(size(words));
828  for (auto const& word: words) nv.push_back(removeWordQuotations(word));
829  return nv;
830 } // icarus::ParsingToolkit::removeEscapes()
831 
832 
833 // -----------------------------------------------------------------------------
834 
835 #endif // ICARUSCODE_PMT_ALGORITHMS_PARSINGTOOLKIT_H
std::string removeWordQuotations(const char *w) const
std::vector< std::string > removeQuotations(Words const &words) const
Returns a copy of words with no quotation starts and ends.
std::pair< std::string, unsigned int > readMultiline(std::istream &in) const
Returns a single line of text from the input stream.
std::pair< std::string_view, QuotSpec_t const * > findQuotationStart(std::string_view sv) const
Finds the start of the next quotation in sv.
double std(const std::vector< short > &wf, const double ped_mean, size_t start, size_t nsample)
Definition: UtilFunc.cxx:42
std::string removeWordEscapes(std::string &&w) const
Returns a copy of w with all escape characters removed.
char escape
Escape character.
bool isQuotationUnclosed(std::string_view sv) const
Returns if the sequence sv has unclosed quotation at its end.
std::vector< std::string > removeEscapes(Words const &words) const
Returns a copy of words with all escape characters removed.
ParsingToolkit()
Default parsing parameters.
std::string comment
Word introducing a comment.
Iter findCommentWord(Iter beginWord, Iter endWord) const
Finds the first word starting with a comment marker.
static SplitView_t splitOn(std::string_view sv, std::string_view sep)
Splits the view sv in three: before sep, sep and after sep.
std::size_t size(FixedBins< T, C > const &) noexcept
Definition: FixedBins.h:561
Params_t fParams
Parsing parameters.
std::vector< QuotSpec_t > quotes
List of matching start and end of quote.
std::string removeWordQuotations(std::string_view w) const
ParsingToolkit(Params_t params)
Creates a parser with the specified parsing parameters.
All parsing parameters.
std::string_view findFirstUnquoted(std::string_view sv, BIter beginKey, EIter endKey) const
Finds the first of the specified keys in the unquoted part of sv.
std::vector< std::string_view > splitWords(std::string const &s, Delim isDelimiter) const
Splits a string into words.
bool isCharacterEscaped(BIter begin, BIter itCh) const
Returns whether the character pointed by itCh is escaped or not.
std::string_view findFirstUnescaped(std::string_view sv, BIter beginKey, EIter endKey) const
Finds the first of the specified keys in sv.
bool isEscape(char ch) const
Returns whether ch is an escape character.
std::string_view removeTrailingCharacters(std::string_view s, CType charType) const
Consumes the blank characters a the beginning of s.
double distance(geo::Point_t const &point, CathodeDesc_t const &cathode)
Returns the distance of a point from the cathode.
static std::string_view make_view(std::string const &s)
Creates a std::string_view from an entire string s.
std::string removeWordEscapes(std::string_view w) const
constexpr bool operator()(Ch c) const noexcept
auto end(FixedBins< T, C > const &) noexcept
Definition: FixedBins.h:585
std::pair< std::string, std::string > QuotSpec_t
Specification of quotation: opening and closing.
std::vector< std::string_view > splitWords(std::string const &s) const
Helper version of splitWords(std::string const&amp;, Delim).
std::string_view::const_iterator findNextCharacter(std::string_view s, Sel select) const
Finds the next character satisfying the specified criterion.
Utilities for text parsing.
void adoptParams(Params_t params)
Initializes the parameters and caches.
Record of a split token: pre-separator, separator and post-separator.
auto begin(FixedBins< T, C > const &) noexcept
Definition: FixedBins.h:573
bool equal(double a, double b)
Comparison tolerance, in centimeters.
std::string_view::const_iterator findNextBlank(std::string_view s) const
Helper function for findNextCharacter(std::string_view, Sel).
static Params_t const DefaultParameters
if &&[-z"$BASH_VERSION"] then echo Attempting to switch to bash bash shellSwitch exit fi &&["$1"= 'shellSwitch'] shift declare a IncludeDirectives for Dir in
Params_t const & params() const noexcept
Returns the current parameters of parsing.
std::string_view findQuotationEnd(std::string_view sv, std::string const &quotEnd) const
Finds the quotation end in sv.
then echo File list $list not found else cat $list while read file do echo $file sed s
Definition: file_to_url.sh:60
static constexpr CCTypeAdapter<&std::isblank > isBlank
Adapter for determining if a character is a blank (see std::isblank()).
std::string_view removeTrailingBlanks(std::string_view s) const
Consumes the blank characters a the beginning of s.
static std::string_view make_view(BIter b, EIter e)
Creates a std::string_view from two string iterators b and e.
do i e
std::string fQuoteStarts
Start characters of all supported quotations.
void removeCommentLine(std::vector< WordType > &words) const
Removes all the words from the one starting with a comment marker.
std::string removeWordQuotations(std::string &&w) const
Returns a copy of w with no quotation starts and ends.
std::string removeWordEscapes(const char *w) const