Parse Values From A Block Of Text Based On Specific Keys
I'm parsing some text from a source outside my control, that is not in a very convenient format.  I have lines like this:  Problem Category: Human Endeavors Problem Subcategory: Sp
Solution 1:
If your block of text is this string:
text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'Then
import re
names = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details']
text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'
pat = r'({}):'.format('|'.join(names))
data = dict(zip(*[iter(re.split(pat, text, re.MULTILINE)[1:])]*2))
print(data)
yields the dict
{'Problem Category': ' Human Endeavors ',
 'Problem Details': ' Issue with signal barrier chamber.',
 'Problem Subcategory': ' Space Exploration',
 'Problem Type': ' Failure to Launch',
 'Software Version': ' 9.8.77.omni.3'}
So you could assign
text = df_dict['NOTE_DETAILS'][0]
...
df_dict['NOTE_DETAILS'][0] = dataand then you could access the subcategories with dict indexing:
df_dict['NOTE_DETAILS'][0]['Problem_Category']Caution, though. Deeply nested dicts/DataFrames of lists of dicts is usually a bad design. As the Zen of Python says, Flat is better than nested.
Solution 2:
Given that you know the keywords ahead of time, partition the text into "current keyword", "remaining text", then continue to partition the remaining text with the next keyword.
# get input from somewhere
raw = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'# these are the keys, in order, without the colon, that will be captured
keys = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details']
prev_key = None
remaining = raw
out = {}
for key in keys:
    # get the value from before the key and after the key
    prev_value, _, remaining = remaining.partition(key + ':')
    # start storing values after the first iteration, since we need to partition the second key to get the first valueif prev_key isnotNone:
        out[prev_key] = prev_value.strip()
    # what key to store next iteration
    prev_key = key
# capture the final value (since it lags behind the parse loop)
out[prev_key] = remaining.strip()
# out now contains the parsed values, print it out nicelyfor key in keys:
    print('{}: {}'.format(key, out[key]))
This prints:
Problem Category: Human Endeavors
Problem Subcategory: Space Exploration
Problem Type: Failure to Launch
Software Version: 9.8.77.omni.3
Problem Details: Issue with signal barrier chamber.
Solution 3:
I hate and fear regex, so here's a solution using only built-in methods.
#splits a string using multiple delimiters.defmulti_split(s, delims):
    strings = [s]
    for delim in delims:
        strings = [x for s in strings for x in s.split(delim) if x]
    return strings
s = "Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber."
categories = ["Problem Category", "Problem Subcategory", "Problem Type", "Software Version", "Problem Details"]
headers = [category + ": "for category in categories]
details = multi_split(s, headers)
print details
details_dict = dict(zip(categories, details))
print details_dict
Result (newlines added by me for readability):
[
    'Human Endeavors ', 
    'Space Exploration', 
    'Failure to Launch', 
    '9.8.77.omni.3', 
    'Issue with signal barrier chamber.'
]
{
    'Problem Subcategory': 'Space Exploration', 
    'Problem Details': 'Issue with signal barrier chamber.', 
    'Problem Category': 'Human Endeavors ', 
    'Software Version': '9.8.77.omni.3', 
    'Problem Type': 'Failure to Launch'
}
Solution 4:
That's just the job for general BNF parsing which handles ambiguity nicely. I used perl and Marpa, a general BNF parser. Hope this helps.
use5.010;
use strict;
use warnings;
use Marpa::R2;
my $g = Marpa::R2::Scanless::G->new( { source => \(<<'END_OF_SOURCE'),
    :default ::= action => [ name, values ]
    pairs ::= pair+
    pair ::= name (' ') value
    name ::= 'Problem Category:'
    name ::= 'Problem Subcategory:'
    name ::= 'Problem Type:'
    name ::= 'Software Version:'
    name ::= 'Problem Details:'
    value ::= [\s\S]+
    :discard ~ whitespace
    whitespace ~ [\s]+
END_OF_SOURCE
} );
my $input = <<EOI;
Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.
EOI
my $ast = ${ $g->parse( \$input ) };
my @pairs;
ast_traverse($ast);
formy $pair (@pairs){
    my ($name, $value) = @$pair;
    say"$name = $value";
}
subast_traverse{
    my $ast = shift;
    if (ref $ast){
        my ($id, @children) = @$ast;
        if ($id eq 'pair'){
            my ($name, $value) = @children;
            chop $name->[1];
            shift @$value;
            $value = join('', @$value);
            chomp $value;
            push @pairs, [ $name->[1], '"' . $value . '"' ];
        }
        else {
            ast_traverse($_) for @children;
        }
    }
}
This prints:
ProblemCategory="Human Endeavors "ProblemSubcategory="Space Exploration"ProblemType="Failure to Launch"SoftwareVersion="9.8.77.omni.3"ProblemDetails="Issue with signal barrier chamber."
Post a Comment for "Parse Values From A Block Of Text Based On Specific Keys"