How To Skip More Then One Lines Of Header In Rdd In Spark

April 21, 2024 Post a Comment

Data in my first RDD is like 1253 545553 12344896 1 2 1 1 43 2 1 46 1 1 53 2 Now the first 3 integers are some counters that I need to broadcast. After that all the lines have th

Solution 1:

Imports for Python 2

from __future__ import print_function

Prepare dummy data:

s = "1253\n545553\n12344896\n1 2 1\n1 43 2\n1 46 1\n1 53 2"withopen("file.txt", "w") as fw: fw.write(s)

Read raw input:
```
raw = sc.textFile("file.txt")
```

Extract header:

header = raw.take(3)
print(header)
### [u'1253', u'545553', u'12344896']

Filter lines:

using zipWithIndex

content = raw.zipWithIndex().filter(lambda kv: kv[1] > 2).keys()
print(content.first())
## 1 2 1

using mapPartitionsWithIndex

from itertools import islice

content = raw.mapPartitionsWithIndex(
    lambda i, iter: islice(iter, 3, None) if i == 0elseiter)

print(content.first())
## 1 2 1

NOTE: All credit goes to pzecevic and Sean Owen (see linked sources).

Solution 2:

In my case I have a csv file like below

----- HEADER START -----
We love to generate headers
#who needs comment char?
----- HEADER END -----

colName1,colName2,...,colNameN
val__1.1,val__1.2,...,val__1.N

Took me a day to figure out

val rdd = spark.read.textFile(pathToFile)  .rdd
  .zipWithIndex() // get tuples (line, Index)
  .filter({case (line, index) => index > numberOfLinesToSkip})
  .map({case (line, index) => l}) //get rid of index
val ds = spark.createDataset(rdd) //convert rdd to dataset
val df=spark.read.option("inferSchema", "true").option("header", "true").csv(ds) //parse csv

Sorry code in scala, however can be easily converted to python

Solution 3:

First take the values using take() method as zero323 suggested

raw  = sc.textfile("file.txt")
headers = raw.take(3)

Then

final_raw = raw.filter(lambda x: x != headers)

and done.

lacucinadiadine

How To Skip More Then One Lines Of Header In Rdd In Spark

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "How To Skip More Then One Lines Of Header In Rdd In Spark"

Widget HTML #3