[关闭]
@twein89 2016-06-22T22:19:04.000000Z 字数 2849 阅读 448

spark cs105_lab0

spark python


Part 1: Test Spark functionality

(1a) Create a DataFrame and filter it
  1. # Check that Spark is working
  2. from pyspark.sql import Row
  3. data = [('Alice', 1), ('Bob', 2), ('Bill', 4)]
  4. df = sqlContext.createDataFrame(data, ['name', 'age'])
  5. fil = df.filter(df.age > 3).collect()
  6. print fil
  7. # If the Spark job doesn't work properly this will raise an AssertionError
  8. assert fil == [Row(u'Bill', 4)]
  1. [Row(name=u'Bill', age=4)]
(2b) Loading a text file

Let's load a text file.

  1. # Check loading data with sqlContext.read.text
  2. import os.path
  3. baseDir = os.path.join('databricks-datasets', 'cs100')
  4. inputPath = os.path.join('lab1', 'data-001', 'shakespeare.txt')
  5. fileName = os.path.join(baseDir, inputPath)
  6. dataDF = sqlContext.read.text(fileName)
  7. shakespeareCount = dataDF.count()
  8. print shakespeareCount
  9. # If the text file didn't load properly an AssertionError will be raised
  10. assert shakespeareCount == 122395
  1. 122395

Part 3: Test class testing library

(3a)Compare with hash
  1. # TEST Compare with hash (2a)
  2. # Check our testing library/package
  3. # This should print '1 test passed.' on two lines
  4. from databricks_test_helper import Test
  5. twelve = 12
  6. Test.assertEquals(twelve, 12, 'twelve should equal 12')
  7. Test.assertEqualsHashed(twelve, '7b52009b64fd0a2a49e6d8a939753077792b0554',
  8. 'twelve, once hashed, should equal the hashed value of 12')
(3b) Compare lists
  1. # TEST Compare lists (2b)
  2. # This should print '1 test passed.'
  3. unsortedList = [(5, 'b'), (5, 'a'), (4, 'c'), (3, 'a')]
  4. Test.assertEquals(sorted(unsortedList), [(3, 'a'), (4, 'c'), (5, 'a'), (5, 'b')],
  5. 'unsortedList does not sort properly')

Part 4: Check plotting

(3a) Our first plot
  1. # Check matplotlib plotting
  2. import matplotlib.pyplot as plt
  3. import matplotlib.cm as cm
  4. from math import log
  5. # function for generating plot layout
  6. def preparePlot(xticks, yticks, figsize=(10.5, 6), hideLabels=False, gridColor='#999999', gridWidth=1.0):
  7. plt.close()
  8. fig, ax = plt.subplots(figsize=figsize, facecolor='white', edgecolor='white')
  9. ax.axes.tick_params(labelcolor='#999999', labelsize='10')
  10. for axis, ticks in [(ax.get_xaxis(), xticks), (ax.get_yaxis(), yticks)]:
  11. axis.set_ticks_position('none')
  12. axis.set_ticks(ticks)
  13. axis.label.set_color('#999999')
  14. if hideLabels: axis.set_ticklabels([])
  15. plt.grid(color=gridColor, linewidth=gridWidth, linestyle='-')
  16. map(lambda position: ax.spines[position].set_visible(False), ['bottom', 'top', 'left', 'right'])
  17. return fig, ax
  18. # generate layout and plot data
  19. x = range(1, 50)
  20. y = [log(x1 ** 2) for x1 in x]
  21. fig, ax = preparePlot(range(5, 60, 10), range(0, 12, 1))
  22. plt.scatter(x, y, s=14**2, c='#d6ebf2', edgecolors='#8cbfd0', alpha=0.75)
  23. ax.set_xlabel(r'$range(1, 50)$'), ax.set_ylabel(r'$\log_e(x^2)$')
  24. display(fig)
  25. pass

Part 4: Check MathJax formulas

(5a) Gradient descent formula

You should see a formula on the line below this one:

This formula is included inline with the text and is .

(5b) Log loss formula

This formula shows log loss for single point. Log loss is defined as:

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注