Filling Missing Values Using Numpy.genfromtxt
Solution 1:
Using pandas:
import pandas as pd
df = pd.read_table('data', sep='\s+', header=None)
df.fillna(0, inplace=True)
print(df)
# 0 1 2# 0 1 2 3# 1 4 5 6# 2 7 8 0
pandas.read_table
replaces missing data with NaN
s. You can replace those NaN
s with some other value using df.fillna
.
df
is a pandas.DataFrame
. You can access the underlying NumPy array with df.values
:
print(df.values)
# [[ 1. 2. 3.]
# [ 4. 5. 6.]
# [ 7. 8. 0.]]
Solution 2:
The issue is that numpy doesn't like ragged arrays. Since there is no character in the third position of the last row of the file, so genfromtxt doesn't even know it's something to parse, let alone what to do with it. If the missing value had a filler (any filler) such as:
1 2 3
4 5 6
7 8 ''
Then you'd be able to:
sol = np.genfromtxt("a.txt",
dtype=float,
invalid_raise=False,
missing_values='',
usemask=False,
filling_values=0.0)
and: sol
array([[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., nan]])
Unfortunately, if making the columns of the file uniform isn't an option, you might be stuck with line-by-line parsing.
One other possibility would be IF all the "short" rows are at the end... in which case you might be able to utilize the 'usecols' flag to parse all columns that are uniform, and then the skip_footer flag to do the same for the remaining columns while skipping those that aren't available:
sol = np.genfromtxt("a.txt",
dtype=float,
invalid_raise=False,
usemask=False,
filling_values=0.0,
usecols=(0,1))
sol
array([[ 1., 2.],
[ 4., 5.],
[ 7., 8.]])
sol2 = np.genfromtxt("a.txt",
dtype=float,
invalid_raise=False,
usemask=False,
filling_values=0.0,
usecols=(2,),
skip_footer=1)
sol2
array([ 3., 6.])
And then combine the arrays from there adding the fill value:
sol2=np.append(sol2, 0.0)
sol2=sol2.reshape(3,1)
sol=np.hstack([sol,sol2])
sol
array([[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 0.]])
Solution 3:
In my experience the best is to just parse manually, this function works for me, it might be slow but generally fast enough.
def manual_parsing(filename,delim,dtype):
out = list()
lengths = list()
with open(filename,'r') as ins:
for line in ins:
l = line.split(delim)
out.append(l)
lengths.append(len(l))
lim = np.max(lengths)
for l in out:
while len(l)<lim:
l.append("nan")
return np.array(out,dtype=dtype)
Post a Comment for "Filling Missing Values Using Numpy.genfromtxt"