Whatever occurs to me at the time to discuss, on the general subjects of IT and programming.

Friday, March 7, 2008

Perl-like split for Erlang

I just know someone has already done this, and probably numerous times. But I was working on a user_default module to customize my Erlang shell, with the intention of making it a good interactive shell for file and system management (more on this in the future). And I came across a couple places where I really wanted the perl utility (split()) instead of the typical Unix shell one (cut).

In particular I think Perl's split is nice, and I wanted an equivalent in Erlang. Here it is (this is a snippet rather than a module because right now it's just in my user_default:

split(Pred, List) ->
split(Pred, List, -1).

%% our fun split is the inverse of lists:splitwith/2 because I feel it's
%% more obvious to write: split(fun(C) -> C =:= $. end, List), given the
%% other options. If you provide a character "predicate" or a list "predicate"
%% then you are *splitting on* (not retaining) the character/sublist
%% that satisfies the condition. -pd
split2(Pred, List) when is_function(Pred) ->
{First, Rest} = lists:splitwith(fun(C) -> not Pred(C) end, List),
case Rest of
[] -> [First, []];
[_|Last] -> [First, Last]
end;
%% Certain common cases given as special atoms
split2(Space, List) when Space == ' '; Space == space ->
split2({rx, "\\s+"}, List);
split2(Comma, List) when Comma == ','; Comma == comma ->
split2({rx, "\\s*,\\s*"}, List);
split2({RxSpec, RX}, List) when RxSpec == rx;
RxSpec == regex;
RxSpec == regexp ->
tuple_to_list(splitrx(RX, List));
split2(Pred, List) when is_list(Pred) ->
tuple_to_list(splitsub(Pred, List, []));
split2(Pred, List) ->
split2(fun(X) -> X =:= Pred end, List).

%% We don't use regexp:split because we only want two limbs at a time
%% for counting up
splitrx(RX, List) ->
case regexp:first_match(List, RX) of
{match, 1, 0} ->
throw({regex_error, zero_width_match_at_start_of_list});
{match, Start, Len} ->
{lists:sublist(List, Start - 1),
lists:nthtail(Start + Len - 1, List)};
nomatch ->
{List, []}
end.

%% Note--does not include substring in split, unlike lists:splitwith/2
splitsub(_, [], Acc) ->
{lists:reverse(Acc), []};
splitsub(Str, List = [H|T], Acc) ->
case lists:prefix(Str, List) of
true -> {lists:reverse(Acc), lists:nthtail(length(Str), List)};
_ -> splitsub(Str, T, [H|Acc])
end.

split(Pred, List, N) ->
split(Pred, List, N, []).

split(_, [], _, Chunks) ->
lists:reverse(Chunks);
split(_, Rest, 1, Chunks) ->
lists:reverse([Rest|Chunks]);
split(Pred, List, N, Chunks) ->
[First, Rest] = split2(Pred, List),
case First of
[] ->
split(Pred, Rest, N, Chunks);
_ ->
split(Pred, Rest, N - 1, [First|Chunks])
end.
The split/2 and split/3 functions work a lot like Perl split() (except there is no split/1 and no default variable of course).

These functions split a list (usually but not necessarily a string--they're more akin to lists:splitwith/2 than string:tokens/2) on the given separator (which in the code above, I've confusingly called Pred). The three-argument version also has a limit--the maximum number of sublists to generate. The functions return a list of lists.

The separator can be given in one of the following ways:

  • An integer, such as $,
  • A string (list)--the list will be split at the sublist.
  • A fun, the list will be split on a single element for which the fun returns true. Note: This is the opposite sense from lists:splitwith/2.
  • A tuple of the form {rx, RegularExpression}--the string will be split on the substrings which match the regular expression. Care must be taken to avoid using a regular expression like " *" which makes a zero-width match at the beginning of the string; the split functions will throw an exception.
  • Certain "magical" atoms:
    • ' ' or space - Equivalent to {rx, " +"}
    • ',' or comma - Equivalent to {rx, "\\s*,\\s*"}

One of the useful features of these functions is that, unlike lists:splitwith/2, they do not include the separator sublist in the resulting lists.

I also think you could come up with a really nice set of commonly-used split patterns that you can refer to by atom.

Another thing that would be nice (as usual) is to also be operating on binary strings, but that runs into encoding issues, of course.

No comments: