Skip to content Skip to sidebar Skip to footer

How To Skip More Then One Lines Of Header In Rdd In Spark

Data in my first RDD is like 1253 545553 12344896 1 2 1 1 43 2 1 46 1 1 53 2 Now the first 3 integers are some counters that I need to broadcast. After that all the lines have th

Solution 1:

  1. Imports for Python 2

    from __future__ import print_function
    
  2. Prepare dummy data:

    s = "1253\n545553\n12344896\n1 2 1\n1 43 2\n1 46 1\n1 53 2"withopen("file.txt", "w") as fw: fw.write(s)
    
  3. Read raw input:

    raw = sc.textFile("file.txt")
    
  4. Extract header:

    header = raw.take(3)
    print(header)
    ### [u'1253', u'545553', u'12344896']
    
  5. Filter lines:

    • using zipWithIndex

      content = raw.zipWithIndex().filter(lambda kv: kv[1] > 2).keys()
      print(content.first())
      ## 1 2 1
    • using mapPartitionsWithIndex

      from itertools import islice
      
      content = raw.mapPartitionsWithIndex(
          lambda i, iter: islice(iter, 3, None) if i == 0elseiter)
      
      print(content.first())
      ## 1 2 1

NOTE: All credit goes to pzecevic and Sean Owen (see linked sources).

Solution 2:

In my case I have a csv file like below

----- HEADER START -----
We love to generate headers
#who needs comment char?
----- HEADER END -----

colName1,colName2,...,colNameN
val__1.1,val__1.2,...,val__1.N

Took me a day to figure out

val rdd = spark.read.textFile(pathToFile)  .rdd
  .zipWithIndex() // get tuples (line, Index)
  .filter({case (line, index) => index > numberOfLinesToSkip})
  .map({case (line, index) => l}) //get rid of index
val ds = spark.createDataset(rdd) //convert rdd to dataset
val df=spark.read.option("inferSchema", "true").option("header", "true").csv(ds) //parse csv

Sorry code in scala, however can be easily converted to python

Solution 3:

First take the values using take() method as zero323 suggested

raw  = sc.textfile("file.txt")
headers = raw.take(3)

Then

final_raw = raw.filter(lambda x: x != headers)

and done.

Post a Comment for "How To Skip More Then One Lines Of Header In Rdd In Spark"