在好例子网,分享、交流、成长!
您当前所在位置:首页Java 开发实例Java语言基础 → 数据算法--Hadoop-Spark大数据处理技巧

数据算法--Hadoop-Spark大数据处理技巧

Java语言基础

下载此实例
  • 开发语言:Java
  • 实例大小:36.93M
  • 下载次数:6
  • 浏览次数:15
  • 发布时间:2022-09-21
  • 实例类别:Java语言基础
  • 发 布 人:wangjinghong
  • 文件格式:.pdf
  • 所需积分:2
 相关标签: Hadoop SPARK 数据处理 大数据 ADO

实例介绍

【实例简介】数据算法--Hadoop-Spark大数据处理技巧

【实例截图】

【核心代码】


. 0.1 Introduction
3. 0.2 Relationship of Spark and Hadoop
4. 0.3 What is MapReduce?
5. 0.4 Why use MapReduce?
6. 0.5 What Is in This Book?
7. 0.6 What Is the Focus of This Book?
8. 0.7 What are Core Concepts of MapReduce/Hadoop?
9. 0.8 Is MapReduce for Everything?
10. 0.9 What is not MapReduce
11. 0.10 Who Is This Book For?
12. 0.11 What Software Is Used in This Book?
13. 0.12 Using Code Examples
14. 0.13 Where NOT to use MapReduce?
15. 0.14 Chapters in This Book?
16. 0.15 Online Resources
17. 0.16 Comments and Questions for This Book?
18. 1 Secondary Sort: Introduction
19. 1.1 What is a Secondary Sort Problem?
20. 1.2 Solutions to Secondary Sort Problem
21. 1.2.1 Sort Order of Intermediate Keys
22. 1.3 Data Flow Using Plug-in Classes
23. 1.4 Mapreduce/Hadoop Solution
24. 1.4.1 Input
25. 1.4.2 Expected Output
26. 1.4.3 map() function
27. 1.4.4 reduce() function
28. 1.4.5 Hadoop Implementation
29. 1.4.6 Sample Run of Hadoop Implementation
30. 1.4.7 Sample Run
31. 1.5 What If Sorting Ascending or Descending
32. 1.6 Spark Solution To Secondary Sorting
33. 1.6.1 Time-Series as Input
34. 1.6.2 Expected Output
35. 1.6.3 Option-1: Secondary Sorting in Memory
36. 1.6.4 Spark Sample Run
37. 1.6.5 Option-2: Secondary Sorting using Framework
38. 2 Secondary Sorting: Detailed Example
39. 2.1 Introduction
40. 2.2 Secondary Sorting Technique
41. 2.3 Complete Example of Secondary Sorting
42. 2.3.1 Problem Statement
43. 2.3.2 Input Format
44. 2.3.3 Output Format
45. 2.3.4 Composite Key
46. 2.3.5 Sample Run
47. 2.4 Secondary Sort using New Hadoop API
48. 3 Top 10 List
49. 3.1 Introduction
50. 3.2 Top-N Formalized
51. 3.3 MapReduce Solution
52. 3.4 Implementation in Hadoop
53. 3.4.1 Input
54. 3.4.2 Sample Run 1: find top 10 list
55. 3.4.3 Output
56. 3.4.4 Sample Run 2: find top 5 list
57. 3.5 Bottom 1058. 3.6 Spark Implementation: Unique Keys
59. 3.6.1 Introduction
60. 3.6.2 What is an RDD?
61. 3.6.3 Spark's Function Classes
62. 3.6.4 Spark Solution for Top-10 Pattern
63. 3.6.5 Complete Spark Solution for Top-10 Pattern
64. 3.6.6 Input
65. 3.6.7 Sample Run : find top-10 list
66. 3.7 What If for Top-N
67. 3.7.1 Shared Data Structures Definition and Usage
68. 3.8 What If for Bottom-N
69. 3.9 Spark Implementation : Non-Unique Keys
70. 3.9.1 Complete Spark Solution for Top-10 Pattern
71. 4 Left Outer Join in MapReduce
72. 4.1 Introduction
73. 4.2 Implementation of Left Outer Join in MapReduce
74. 4.2.1 MapReduce Phase-1
75. 4.2.2 MapReduce Phase-2: Counting Unique Locations ...
76. 4.2.3 Implementation Classes in Hadoop
77. 4.3 Sample Run
78. 4.3.1 Input for Phase-1
79. 4.3.2 run Phase-1
80. 4.3.3 View Output of Phase-1 (Input of Phase-2)
81. 4.3.4 Run Phase-2
82. 4.3.5 View Output of Phase-2
83. 4.4 Spark Implementation
84. 4.4.1 Spark Program
85. 4.4.2 STEP-0: Import Required Classes
86. 4.4.3 STEP-1: Read Input Parameters
87. 4.4.4 STEP-2: Create JavaSparkContext Object
88. 4.4.5 STEP-3: Create a JavaPairRDD for Users
89. 4.4.6 STEP-4: Create a JavaPairRDD for Transactions
90. 4.4.7 STEP-5: Create a union of RDD's created by STEP-3 and STEP-4
91. 4.4.8 STEP-6: Create a JavaPairRDD(userID, List(T2)) by calling groupBy()
92. 4.4.9 STEP-7: Create a productLocationsRDD as JavaPair-RDD(String,String)
93. 4.4.10 STEP-8: Find all locations for a product
94. 4.4.11 STEP-9: Finalize output by changing "value"
95. 4.4.12 STEP-10: Print the final result RDD
96. 4.4.13 Running Spark Solution
97. 4.5 Running Spark on YARN
98. 4.5.1 Script to Run Spark on YARN
99. 4.5.2 Running Script
100.
4.5.3 Checking Expected Output
101.
4.6 Left Outer Join by Spark's leftOuterJoin()
102.
4.6.1 High-Level Steps
103.
4.6.2 STEP-0: import required classes and interfaces
104.
4.6.3 STEP-1: read input parameters
105.
4.6.4 STEP-2: create Spark's context object
106.
4.6.5 STEP-3: create RDD for user's data
107.
4.6.6 STEP-4: Create usersRDD: The "right" Table
108.
4.6.7 STEP-5: create transactionRDD for transaction's data
109.
4.6.8 STEP-6: Create transactionsRDD: The Left Table
110.
4.6.9 STEP-7: use Spark's built-in JavaPairRDD.leftOuterJoin() method
111.
4.6.10 STEP-8: create (product, location) pairs
112.
4.6.11 STEP-9: group (K=product, V=location) pairs by K .
113.
4.6.12 STEP-10: create final output (K=product, V=Set(location))114.
4.6.13 Sample Run by YARN
115.
5 Order Inversion Pattern
116.
5.1 Introduction
117.
5.2 Example of Order Inversion Pattern
118.
5.3 MapReduce for Order Inversion Pattern
119.
5.3.1 Custom Partitioner
120.
5.3.2 Relative Frequency Mapper
121.
5.3.3 Relative Frequency Reducer
122.
5.3.4 Implementation Classes in Hadoop
123.
5.4 Sample Run
124.
5.4.1 Input
125.
5.4.2 Running MapReduce Job
126.
5.4.3 Generated Output
127.
6 Moving Average
128.
6.1 Introduction
129.
6.1.1 Example-1: Time Series Data
130.
6.1.2 Example-2: Time Series Data
131.
6.2 Formal Definition
132.
6.3 Moving Average by POJO
133.
6.3.1 First solution: using Queue
134.
6.3.2 Second Solution : using Array
135.
6.3.3 Testing of Moving Average
136.
6.3.4 Sample Run
137.
6.4 MapReduce Solution
138.
6.4.1 Input
139.
6.4.2 Output
140.
6.4.3 MapReduce Solution: Option-1: sort in RAM
141.
6.4.4 Hadoop Implementation: sort in RAM
142.
6.4.5 Sample Run
143.
6.4.6 MapReduce Solution: Option-2: Sort by MR Framework
144.
6.5 Sample Run
145.
7 Market Basket Analysis
146.
7.1 What is Market Basket Analysis?
147.
7.2 MapReduce/Hadoop Solution
148.
7.3 What are the Application areas for MBA?
149.
7.4 Market Basket Analysis using MapReduce
150.
7.4.1 Mapper Formal
151.
7.4.2 Reducer
152.
7.5 MapReduce/Hadoop Implementation Classes
153.
7.5.1 Find Sorted Combinations
154.
7.5.2 Market Basket Analysis Driver: MBADriver
155.
7.5.3 Market Basket Analysis Mapper: MBAMapper ....
156.
7.5.4 Sample Run
157.
7.6 Spark/Hadoop Solution
158.
7.6.1 MapReduce Algorithm
159.
7.6.2 Input
160.
7.6.3 Spark Implementation
161.
7.6.4 Creating Item Sets From Transactions
162.
8 Common Friends
163.
8.1 Introduction
164.
8.2 Input
165.
8.3 Common Friends Algorithm
166.
8.4 MapReduce Algorithm
167.
8.4.1 MapReduce Algorithm in Action
168.
8.5 Solution 1: Hadoop Implementation using Text
169.
8.5.1 Sample Run for Solution 1170.
8.6 Solution 2: Hadoop Implementation using ArrayListOfLongsWritable
171.
8.6.1 Sample Run for Solution 2
172.
8.7 Spark Solution
173.
8.7.1 STEP-0: Import Required Classes
174.
8.7.2 STEP-1: Check Input Parameters
175.
8.7.3 STEP-2: Create a JavaSparkContext Object
176.
8.7.4 STEP-3: Read Input
177.
8.7.5 STEP-4: Apply a Mapper
178.
8.7.6 STEP-5: Apply a Reducer
179.
8.7.7 STEP-6: Find Common Friends
180.
8.8 Sample Run of a Spark Program
181.
8.8.1 HDFS Input
182.
8.8.2 Script to Run Spark Program
183.
8.8.3 Log of Sample Run
184.
9 Recommendation Engines using MapReduce
185.
9.1 Customers Who Bought This Item Also Bought
186.
9.1.1 Input
187.
9.1.2 Expected Output
188.
9.1.3 MapReduce Solution
189.
9.2 Frequently Bought Together
190.
9.2.1 Input
191.
9.2.2 MapReduce Solution
192.
9.3 Recommend People Connection
193.
9.3.1 Input
194.
9.3.2 Output
195.
9.3.3 MapReduce Solution
196.
9.4 Spark Implementation
197.
9.4.1 STEP-0: Import Required Classes
198.
9.4.2 STEP-1: Handle Input Parameters
199.
9.4.3 STEP-2: Create Spark's Context Object
200.
9.4.4 STEP-3: Read HDFS Input File
201.
9.4.5 STEP-4: Implement map() Function
202.
9.4.6 STEP-5: Implement reduce() Function
203.
9.4.7 STEP-6: Generate Final Output
204.
9.4.8 Convenient Methods
205.
9.4.9 HDFS Input
206.
9.4.10 Script to Run Spark Program
207.
9.4.11 Program Run Log
208.
10 Content-Based Recommendation: Movies
209.
10.1 Input
210.
10.2 MapReduce PHASE-1
211.
10.3 MapReduce PHASE-2 and PHASE-3
212.
10.4 MapReduce-Phase-2 Mapper
213.
10.5 MapReduce-Phase-2 Reducer
214.
10.6 MapReduce-Phase-3 Mapper
215.
10.7 MapReduce-Phase-3 Reducer
216.
10.8 More Similarity Measures
217.
10.9 Movie Recommendation in Spark
218.
10.9.1 High-Level Solution in Spark
219.
10.9.2 High-Level Solution: All Steps
220.
10.9.3 STEP-0: Import Required Classes
221.
10.9.4 STEP-1: Handle Input Parameters
222.
10.9.5 STEP-2: Create a Spark's Context Object
223.
10.9.6 STEP-3: Read Input File and Create RDD
224.
10.9.7 STEP-4: Find Who Has Rated Movies
225.
10.9.8 STEP-5: Group moviesRDD by Movie226.
10.9.9 STEP-6: Find Number of Raters per Movie
227.
10.9.10 STEP-7: Perform Self-Join
228.
10.9.11 STEP-8: Remove Duplicate (movie1, movie2) Pairs
229.
10.9.12 STEP-9: Generate All (movie1, movie2) Combinations
230.
10.9.13 STEP-10: Group Movie Pairs
231.
10.9.14 STEP-11: Calculate Correlations
232.
10.9.15 STEP-12: Print Final Results
233.
10.9.16 Helper Method: calculateCorrelations()
234.
10.9.17 Helper Method: calculatePearsonCorrelation()
235.
10.9.18 Helper Method: calculateCosineCorrelation()
236.
10.9.19 Helper Method: calculateJaccardCorrelation()
237.
10.10 Sample Run of Spark Program
238.
10.10.1 HDFS Input
239.
10.10.2 Script
240.
10.11 Log of Sample Run
241.
10.11.1 Inspecting HDFS Output
242.
11 Smarter Email Marketing with Markov Model
243.
11.1 Introduction
244.
11.2 Markov Chain in a Nutshell
245.
11.3 Markov Model using MapReduce
246.
11.3.1 MapReduce to Generate Time-ordered Transactions
247.
11.3.2 MapReduce to Generate Markov State Transition
248.
11.4 Using Markov Model to Predict Next Email Marketing Date
249.
12 K-Means Clustering
250.
12.1 Introduction
251.
12.2 What is K-Means Clustering
252.
12.3 What are the Applications of Clustering?
253.
12.4 K-Means Clustering Method: Partitioning Approach
254.
12.5 K-Means Distance Function
255.
12.6 K-Means Clustering Step-by-Step Example
256.
12.7 K-Means Clustering Formalized
257.
12.8 MapReduce Solution for K-Means Clustering
258.
12.8.1 MapReduce Solution: map()
259.
12.8.2 MapReduce Solution: combine()
260.
12.8.3 MapReduce Solution: reduce()
261.
12.9 MapReduce K-Means Clustering Step-by-Step Example
262.
12.10 K-Means Implementation by Spark
263.
12.10.1 Sample Run of K-Means by Spark
264.
13 kNN: k-Nearest-Neighbors
265.
13.1 Introduction
266.
13.2 kNN Classification
267.
13.3 Distance Functions
268.
13.4 kNN Example
269.
13.5 An Informal kNN Algorithm
270.
13.6 Formal kNN Algorithm
271.
13.6.1 Java-like Non-MapReduce Solution for kNN
272.
13.7 kNN Implementation in Spark
273.
13.7.1 Formalizing kNN for Spark Implementation
274.
13.7.2 Input Data Set Formats
275.
13.7.3 kNN Implementation in Spark
276.
14 Naive Bayes
277.
14.1 Introduction
278.
14.2 Training and Learning Stage
279.
14.2.1 Example: Training Data (Numeric Data)
280.
14.2.2 Example: Training Data (Symbolic Data)
281.
14.3 Conditional Probability282.
14.4 The Naive Bayes Classifier
283.
14.4.1 The Naive Bayes Classifier Example
284.
14.5 The Naive Bayes Classifier: MapReduce Solution for Symbolic Data
285.
14.5.1 STAGE-1: Building Classifier Using Symbolic Training Data
286.
14.5.2 STAGE-2: Using Classifier To Classify New Symbolic Data
287.
14.6 The Naive Bayes Classifier: MapReduce Solution for Numeric Data
288.
14.7 Naive Bayes Classifier Implementation in Spark
289.
14.7.1 STAGE-1: Building Classifier Using Training Data
290.
14.7.2 STAGE-2: Using Classifier To Classify New Data
291.
14.8 Using Apache Mahout
292.
15 Sentiment Analysis
293.
15.1 Introduction
294.
15.1.1 Sentiment Examples
295.
15.1.2 Sentiment Scores: Positive or Negative
296.
15.2 Steps for Sentiment Analysis
297.
15.3 A Simple MapReduce for Sentiment Analysis
298.
15.3.1 map() for Sentiment Analysis
299.
15.3.2 reduce() for Sentiment Analysis
300.
16 Finding, Counting and Listing all Triangles in Large Graphs
301.
16.1 Introduction
302.
16.2 Basic Graph Concepts
303.
16.3 Importance of Counting Triangles
304.
16.4 MapReduce Solution
305.
16.5 MapReduce in Action
306.
16.6 STEP-3: Remove Duplicate Triangles
307.
16.6.1 STEP-3: Mapper
308.
16.6.2 STEP-3: Reduer
309.
16.7 Hadoop Implementation
310.
16.7.1 Sample Run
311.
16.8 Spark Implementation
312.
16.8.1 STEP-0: Import Required Classes and Interfaces
313.
16.8.2 STEP-1: Read Input Parameters
314.
16.8.3 STEP-2: Create a Spark Context Object
315.
16.8.4 STEP-3: Read Graph via HDFS Input
316.
16.8.5 STEP-4: Create All Graph Edges
317.
16.8.6 STEP-5: Create RDD To Generate Triads
318.
16.8.7 STEP-6: Create All Possible Triads
319.
16.8.8 STEP-7: Create RDD To Generate Triangles
320.
16.8.9 STEP-8: Create All Triangles
321.
16.8.10 Step-9: Create Unique Triangles
322.
16.9 Spark Sample Run
323.
16.9.1 Input
324.
16.9.2 Script
325.
16.9.3 Running Script
326.
17 K-mer Counting
327.
17.1 Introduction to K-mers
328.
17.2 K-mer counting using MapReduce
329.
17.2.1 K-mer counting using MapReduce: map()
330.
17.2.2 K-mer counting using MapReduce: reduce()
331.
17.2.3 K-mer Counting with MapReduce and Hadoop
332.
17.3 Input Data for K-mer Counting
333.
17.3.1 Sample Runs of K-mer Counting
334.
17.4 K-mer Implementation in Spark
335.
17.4.1 K-mer High-Level Solution in Spark
336.
17.4.2 STEP-0: import required classes and interfaces
337.
17.4.3 createJavaSparkContext()338.
17.4.4 STEP-1: handle input parameters
339.
17.4.5 STEP-2: create a Spark context object
340.
17.4.6 STEP-3: broadcast global shared objects
341.
17.4.7 STEP-4: read FASTQ file from HDFS and create the first RDD
342.
17.4.8 STEP-5: filter redundant records
343.
17.4.9 STEP-6: generate K-mers
344.
17.4.10 STEP-7: combine/reduce frequent kmers
345.
17.4.11 STEP-8: create a local top-N
346.
17.4.12 STEP-9: Find Final top-N
347.
17.4.13 STEP-10: Emit Final top-N
348.
17.4.14 YARN Script for Spark
349.
17.4.15 HDFS Input
350.
17.4.16 Output for Final Top-N
351.
18 DNA-Sequencing
352.
18.1 Introduction
353.
18.2 Input to DNA-Sequencing
354.
18.3 Input data validation
355.
18.4 DNA-Sequencing: Alignment
356.
18.5 MapReduce Algorithms for DNA-Sequencing
357.
18.6 MR Algorithms: Step-1: DNA-Sequencing: Alignment ....
358.
18.6.1 Step-1: map(): Alignment
359.
18.6.2 Step-1: reduce(): Alignment
360.
18.7 Step-2: DNA-Sequencing: Recalibration
361.
18.8 Step-3: DNA-Sequencing: Variant Detection
362.
18.8.1 Variant Detection Mapper
363.
18.8.2 Variant Detection Reducer
364.
19 Cox Regression
365.
19.1 Introduction to Survival Analysis using Cox Regression
366.
19.2 Cox Model in a Nutshell
367.
19.3 MapReduce Solution for Cox Regression
368.
19.3.1 Cox Regression Basic Terminology
369.
19.4 Cox Regression by using R Language
370.
19.5 Problem Statement
371.
19.6 Cox Regression POJO Solution
372.
19.7 Input for MapReduce
373.
19.8 Cox Regression by MapReduce
374.
19.8.1 Cox Regression PHASE-1: map()
375.
19.8.2 Cox Regression PHASE-1: reduce()
376.
19.8.3 Cox Regression PHASE-2: map()
377.
19.8.4 Sample Output Generated by PHASE-2 reduce()
378.
19.8.5 Sample Output Generated by PHASE-2 map()
379.
19.8.6 Cox Regression by MapReduce: How Does It Work
380.
20 Cochran-Armitage Test for Trend
381.
20.1 Introduction
382.
20.2 Cochran-Armitage Algorithm
383.
20.3 Application of Cochran-Armitage
384.
20.4 MapReduce Solution
385.
20.4.1 Input
386.
20.4.2 Expected Output
387.
20.4.3 Mapper
388.
20.4.4 Reducer
389.
20.5 MapReduce/Hadoop Implementation
390.
20.5.1 Sample Run of MapReduce/Hadoop Implementation .
391.
21 Allelic Frequency
392.
21.1 Introduction
393.
21.2 Basic Definitions394.
21.2.1 Chromosome
395.
21.2.2 Bioset
396.
21.2.3 Allele and Allelic Frequency
397.
21.2.4 Source of Data for Allelic Frequency
398.
21.2.5 Allelic Frequency Analysis by Fisher's Exact Test
399.
21.2.6 Fisher's Exact Test
400.
21.3 Formal Problem Statement
401.
21.4 MapReduce Phase-1
402.
21.4.1 Input
403.
21.4.2 Output/Result
404.
21.4.3 MapReduce Solution for Allelec Frequency
405.
21.4.4 Phase-1 Mapper
406.
21.4.5 Phase-1 Reducer
407.
21.4.6 Sample Run of MapReduce/Hadoop Implementation .
408.
21.4.7 Sample Plot of Pvalues
409.
21.5 MapReduce Phase-2
410.
21.5.1 Phase-2: Mapper for Bottom-100
411.
21.5.2 Phase-2: Reducer for "Bottom 100"
412.
21.6 Is Bottom 100 List A Monoid?
413.
21.6.1 Hadoop Solution for Bottom 100 List
414.
21.6.2 Sample Run of Bottom 100 List
415.
21.7 MapReduce Phase-3
416.
21.7.1 Phase-3: Mapper for Bottom-100
417.
21.7.2 Phase-3: Reducer for "Bottom 100"
418.
21.7.3 Hadoop Solution for Bottom 100 List Per Chromosome
419.
21.7.4 Sample Run of Bottom 100 List Per Chromosome
420.
22 The T-Test
421.
22.1 Introduction
422.
22.2 MapReduce Problem Statement
423.
22.3 Input
424.
22.4 Expected Output
425.
22.5 MapReduce Solution
426.
22.6 Hadoop Implementation
427.
22.7 Spark Implementation
428.
22.7.1 High Level Steps
429.
22.7.2 STEP-0: import required classes and interfaces ....
430.
22.7.3 Create JavaSparkContext
431.
22.7.4 Create TimeTable Data Structure
432.
22.7.5 Create RDD for All Biosets
433.
22.7.6 STEP-1: handle input parameters
434.
22.7.7 Create time table data structure
435.
22.7.8 STEP-3: create a spark context object
436.
22.7.9 High Level Steps
437.
22.7.10 STEP-5: create RDD for all biosets
438.
22.7.11 STEP-6: map bioset records into JavaPairRDD(K,V) pairs
439.
22.7.12 STEP-7: group biosets by GENE-ID
440.
22.7.13 STEP-8: perform Ttest for every GENE-ID
441.
22.7.14 Ttest Algorithm
442.
22.7.15 Input for Spark Program
443.
22.7.16 Spark on YARN Script
444.
22.7.17 Sample Run of Script
445.
22.7.18 Generated Outputs
446.
23 Computing Pearson Correlation
447.
23.1 Pearson Correlation Formula
448.
23.2 Pearson Correlation by Example
449.
23.3 Data Set for Pearson Correlation450.
23.4 POJO Solution for Pearson Correlation
451.
23.5 MapReduce Solution for Pearson Correlation
452.
23.5.1 map() for Pearson Correlation
453.
23.5.2 reduce() for Pearson Correlation
454.
23.5.3 reduce() for Pearson Correlation
455.
23.6 Hadoop Implementation for Pearson Correlation
456.
23.7 Pearson Correlation using Spark/Hadoop
457.
23.7.1 Input
458.
23.7.2 Output
459.
23.7.3 Spark Solution
460.
23.7.4 Spark Solution: High-Level Steps
461.
23.7.5 STEP-0: import required classes and interfaces ....
462.
23.7.6 Method smaller()
463.
23.7.7 MutableDouble Class
464.
23.7.8 Method toMap()
465.
23.7.9 Method toListOfString()
466.
23.7.10 Method readBiosets()
467.
23.7.11 STEP-1: handle input parameters
468.
23.7.12 STEP-2: create a Spark context object
469.
23.7.13 STEP-3: create list of input files/biomarkers
470.
23.7.14 STEP-4: broadcast "reference" as global shared object
471.
23.7.15 STEP-5: read all biomarkers from HDFS and create the first RDD
472.
23.7.16 STEP-6: filter biomarkers by reference
473.
23.7.17 STEP-7: create (Gene-ID, (Patient-ID, Gene-Value) pairs
474.
23.7.18 STEP-8: group by gene
475.
23.7.19 STEP-9: create Cartesian product of all genes
476.
23.7.20 STEP-10: filter redundant pairs of genes
477.
23.7.21 STEP-11: calculate Pearson Correlation and p-value .
478.
23.7.22 Pearson Class
479.
23.7.23 Test Pearson Class
480.
23.7.24 Pearson Correlation Using R
481.
23.7.25 YARN Script to Run Spark Program
482.
23.8 Spearman Correlation
483.
23.8.1 Spearman Correlation Wrapper Class
484.
23.8.2 Test Spearman Correlation Wrapper Class
485.
24 DNA Base Count
486.
24.1 Introduction
487.
24.2 FASTA Format
488.
24.2.1 FASTA Format Example
489.
24.3 FASTQ Format
490.
24.3.1 FASTQ Format Example
491.
24.4 MapReduce Solution: FASTA Format
492.
24.4.1 Reading FASTA Files
493.
24.4.2 MapReduce Solution: map()
494.
24.4.3 MapReduce Solution: reduce()
495.
24.5 Hadoop Implementation: FASTA Format
496.
24.5.1 Hadoop Sample Run
497.
24.5.2 What If 1
498.
24.5.3 What If 2
499.
24.6 MapReduce Solution: FASTQ Format
500.
24.6.1 MapReduce Solution: map()
501.
24.6.2 MapReduce Solution: reduce()
502.
24.7 Hadoop Implementation
503.
24.7.1 Sample Run of Hadoop Implementation
504.
24.7.2 Reading FASTQ Files
505.
25 RNA-Sequencing506.
25.1 Introduction
507.
25.2 Data Size and Format
508.
25.3 MapReduce Solution
509.
25.3.1 Input data validation
510.
25.4 MapReduce Algorithms for RNA-Sequencing
511.
25.4.1 STEP-1: MapReduce Tophat mapping
512.
25.4.2 STEP-2: MapReduce Calling Cuffdiff
513.
26 Gene Aggregation
514.
26.1 Introduction
515.
26.2 Input
516.
26.3 Output
517.
26.4 MapReduce Solution
518.
26.4.1 Mapper: Filter by Individual
519.
26.4.2 Reducer: Filter by Individual
520.
26.4.3 Mapper: Filter by Average
521.
26.4.4 Reducer: Filter by Average
522.
26.5 Computing Gene Aggregation
523.
26.6 Hadoop Implementation
524.
26.7 Analysis of Output
525.
26.8 Gene Aggregation in Spark
526.
26.9 Gene Aggregation in Spark: Filter by Individual
527.
26.9.1 High Level Solution
528.
26.9.2 High Level Solution
529.
26.9.3 STEP-1: handle input parameters
530.
26.9.4 STEP-2: Create a Spark Context Object
531.
26.9.5 STEP-3: Broadcast Shard Variables
532.
26.9.6 STEP-4: Create a JavaRDD For Biosets
533.
26.9.7 STEP-5: Map Biosets into JavaPairRDD(K,V) ....
534.
26.9.8 STEP-6: filter out the redundant RDD elements ....
535.
26.9.9 STEP-7: reduce by Key and sum up the frequency count
536.
26.9.10 STEP-8: prepare the final output
537.
26.9.11 Utitilty Functions
538.
26.9.12 Running Spark on YARN
539.
26.10 Gene Aggregation in Spark: Filter by Average
540.
26.10.1 STEP-0: import required classes and interfaces
541.
26.10.2 STEP-1: handle input parameters
542.
26.10.3 STEP-2: create a Java Spark context object
543.
26.10.4 STEP-3: share global variables in all cluster nodes
544.
26.10.5 STEP-4: read all bioset records and create an RDD
545.
26.10.6 STEP-5: map bioset records and create JavaPairRDD(K,V)
546.
26.10.7 STEP-6: filter redundant records created by STEP-5 .
547.
26.10.8 STEP-7: group biosets by geneID and referenceType .
548.
26.10.9 STEP-8: prepare the final desired output
549.
26.10.10 STEP-9: emit the final output
550.
26.10.11 toList() Method
551.
26.10.12 readInputFiles() Method
552.
26.10.13 buildPatientsMap() Method
553.
26.10.14 buildPatientsMap() Method
554.
26.10.15 Running Spark on YARN
555.
27 Linear Regression
556.
27.1 Introduction
557.
27.2 Simple Facts about Linear Regression
558.
27.3 Simple Example
559.
27.4 Problem Statement
560.
27.5 Input Data
561.
27.6 Expected Ouput562.
27.7 MapReduce Solution using Apache Commons SimpleRegression
563.
27.8 Hadoop Implementation using Apache Commons SimpleRegression
564.
27.9 MapReduce Solution using R's Linear Model
565.
27.9.1 MapReduce Solution using R's Linear Model: Phase 1
566.
27.9.2 MapReduce Solution using R's Linear Model: Phase 2
567.
27.9.3 Hadoop Implementation using using R's Linear Model
568.
28 MapReduce and Monoids
569.
28.1 Introduction
570.
28.2 Definition of Monoid
571.
28.2.1 How to form a Monoid?
572.
28.3 Monoidic and Non-Monoidic Examples
573.
28.3.1 Subtraction over Set of Integers
574.
28.3.2 Subtraction over Set of Integers
575.
28.3.3 Addition over Set of Integers
576.
28.3.4 Multiplication over Set of Integers
577.
28.3.5 Mean over Set of Integers
578.
28.3.6 Non-Commutative Example
579.
28.3.7 Median over Set of Integers
580.
28.3.8 Concatenation over Lists
581.
28.3.9 Union/Intersection over Integers
582.
28.3.10 Functional Example
583.
28.3.11 Matrix Example
584.
28.4 MapReduce Example: Not a Monoid
585.
28.5 MapReduce Example: Monoid
586.
28.6 Hadoop Implementation of Monodized MapReduce
587.
28.7 Sample Run of Monodized Hadoop/MapReduce
588.
28.7.1 Create Input File (as a SequenceFile)
589.
28.7.2 Create HDFS Input and Output Directories
590.
28.7.3 Copy Input File to HDFS and Verify
591.
28.7.4 Prepare a shell script to run your MapReduce job
592.
28.7.5 Run MapReduce Job
593.
28.7.6 View Hadoop Output
594.
28.8 Conclusion on Using Monoids
595.
28.9 Functors and Monoids
596.
29 The Small Files Problem
597.
29.1 Introduction
598.
29.2 Solution to The Small Files Problem
599.
29.2.1 Input Data
600.
29.3 Solution With SmallFilesConsolidator
601.
29.3.1 Java Source Files
602.
29.3.2 Sample Run
603.
29.4 Solution Without SmallFilesConsolidator
604.
29.4.1 Java Source Files
605.
29.4.2 Sample Run
606.
30 Huge Cache for MapReduce
607.
30.1 Introduction
608.
30.2 Implementation Options
609.
30.3 Fromalizing the Cache Problem
610.
30.4 Elegant Scalable Solution
611.
30.5 Implementation of Elegant Scalable Solution
612.
30.5.1 Use of LRU Map
613.
30.5.2 Test LRU Map
614.
30.5.3 Use of MapDB
615.
30.5.4 Test of MapDB: put() and get()
616.
30.6 MapReduce Using LRU-Map-Cache
617.
30.7 CacheManager Definition618.
30.7.1 CacheManager Initilization
619.
30.7.2 CacheManager Closing
620.
30.7.3 CacheManager Usage
621.
31 Bloom Filter
622.
31.1 Introduction
623.
31.2 A Simple Bloom Filter Example
624.
31.3 Bloom Filter in Guava Library
625.
31.4 Using Bloom Filter in MapReduce
626.
Appendices
627.
A Bioset
628.
A. 1 Introduction
629.
B Spark RDDs
630.
B.1 Introduction
631.
B.2 What is a TupleN?
632.
B.3 What is an RDD
633.
B.4 How to Create RDDs
634.
B.5 Create RDDs by Collection Objects
635.
B.6 Collect Elements of an RDD
636.
B.7 Transform RDD into New RDD
637.
B.8 Create RDDs by Reading Files
638.
B.9 Grouping By Key
639.
B.10 Map Values
640.
B.11 Reducing By Key
641.
B.12 Filtering an RDD
642.
B.13 Saving RDD as HDFS Text File
643.
B.14 Saving RDD as HDFS Sequence File
644.
B.15 Reading RDD from HDFS Sequence File
645.
B.16 Counting RDD
646.
B.17 Spark RDD Examples in Scala


实例下载地址

数据算法--Hadoop-Spark大数据处理技巧

不能下载?内容有错? 点击这里报错 + 投诉 + 提问

好例子网口号:伸出你的我的手 — 分享

网友评论

发表评论

(您的评论需要经过审核才能显示)

查看所有0条评论>>

小贴士

感谢您为本站写下的评论,您的评论对其它用户来说具有重要的参考价值,所以请认真填写。

  • 类似“顶”、“沙发”之类没有营养的文字,对勤劳贡献的楼主来说是令人沮丧的反馈信息。
  • 相信您也不想看到一排文字/表情墙,所以请不要反馈意义不大的重复字符,也请尽量不要纯表情的回复。
  • 提问之前请再仔细看一遍楼主的说明,或许是您遗漏了。
  • 请勿到处挖坑绊人、招贴广告。既占空间让人厌烦,又没人会搭理,于人于己都无利。
;
报警