LEOChunks

Includes:
<sys/types.h>
<stdint.h>
<stdbool.h>

Introduction

A chunk is a substring of a string. Chunks can not only be specified in characters, but can also be the result of some limited parsing of the given string.



Functions

LEODoForEachChunk
LEOGetChunkRanges

LEODoForEachChunk


void LEODoForEachChunk(
    const char *inStr,
    size_t inBufSize,
    LEOChunkType inType, 
    bool (*inChunkCallback)(
        const char* currStr,
        size_t currLen,
        size_t currStart,
        size_t currEnd,
        void* userData ), 
    uint32_t itemDelimiter,
    void *userData );  
Parameters
inStr

A UTF8-encoded string to be parsed to determine the range of the given chunk, or, for the byte chunk type, an arbitrary buffer of bytes.

inBufSize

The number of bytes in inStr to parse.

inType

The type of unit for the chunk items to pass to the callback.

inChunkCallback

A pointer to a function which will be called for each chunk item. If this function returns FALSE, parsing of the string for chunks will be aborted. Return TRUE to keep going.

itemDelimiter

The item delimiter to use when inType is kLEOChunkTypeItem.

userData

A pointer to an arbitrary block of data that will be passed to inChunkCallback as its userData parameter. Use this to pass in context information that your callback needs. LEODoForEachChunk() does not make any assumptions or do anything with this pointer except pass it on.

Discussion

Determine all the chunks of a certain type in a string and call the given callback for each chunk.


LEOGetChunkRanges


void LEOGetChunkRanges(
    const char *inStr,
    LEOChunkType inType, 
    size_t inRangeStart,
    size_t inRangeEnd, 
    size_t *outChunkStart,
    size_t *outChunkEnd, 
    size_t *outDelChunkStart,
    size_t *outDelChunkEnd, 
    uint32_t itemDelimiter );  
Parameters
inStr

A UTF8-encoded string to be parsed to determine the range of the given chunk, or, for the byte chunk type, an arbitrary zero-terminated string of bytes.

inType

The type of unit you wish to specify this chunk in.

inRangeStart

The start offset of the range expressed in the unit specified by inType.

inRangeEnd

The end offset of the range expressed in the unit specified by inType.

outChunkStart

On return, this is set to a byte offset indicating the start of the payload of the given chunk, without any starting delimiters.

outChunkEnd

On return, this is set to a byte offset indicating the end of the payload of the given chunk, without any ending delimiters.

outDelChunkStart

On return, this is set to a byte offset indicating the start of the given chunk, including any starting or ending delimiters that would have to be deleted to remove this chunk completely from its string.

outDelChunkEnd

On return, this is set to a byte offset indicating the end of the given chunk, including any starting or ending delimiters that would have to be deleted to remove this chunk completely from its string.

itemDelimiter

The item delimiter to use when inType is kLEOChunkTypeItem.

Discussion

Determine what character range corresponds to the given chunk range of inStr. You get back two offset pairs, one for extracting the value from the string, and a second pair for deleting them, which may include one delimiter.


Constants

gLEOChunkTypeNames

gLEOChunkTypeNames


extern const char* gLEOChunkTypeNames[kLEOChunkType_Last +1];  // Last entry is NULL. 
Discussion

String names for each chunk type.


Typedefs

LEOChunkType

LEOChunkType


Constants
kLEOChunkTypeINVALID

Used in some cases to indicate something that *can* be a chunk reference is *not* a chunk.

kLEOChunkTypeByte

Take a byte out of the string. This may tear a byte out of the middle of a UTF8 string and make it invalid as a string.

kLEOChunkTypeCharacter

UTF8-characters. One character may use several bytes, e.g. for a Chinese or Japanese character.

kLEOChunkTypeItem

Items are delimited by a certain character (by default, a comma). If there are several delimiters immediately in sequence, the items between them are considered to be empty. Items are assumed to be UTF8-strings.

kLEOChunkTypeLine

Lines are delimited by a return or a line feed. Otherwise, lines behave like items.

kLEOChunkTypeWord

Words are delimited by one or more spaces, tabs, returns or line feeds, i.e. whitespace characters. There can be no 'empty' words, and punctuation is treated just like any other alphabetic character.

Discussion

There are different kinds of chunks that are parsed differently, depending on which of these flags you pass in.