| High Performance Computing on Vector Systems 2011 | 348 |
|---|
| 3 | 348 |
|---|
| Preface | 5 |
| Contents | 7 |
| Part I: Techniques and Tools for High Performance Systems | 9 |
| Performance and Scalability Analysisof a Chip Multi Vector Processor | 10 |
| 1 Introduction | 11 |
| 2 Chip Multi Vector Processor | 12 |
| 2.1 Structure of a Chip Multi Vector Processor | 12 |
| 2.2 Performance Model of a Chip Multi Vector Processor | 13 |
| 3 Performance Tuning for a Chip Multi Vector Processor | 15 |
| 3.1 Performance Analysis Using the Roofline Model | 15 |
| 3.2 Program Optimization | 16 |
| 3.2.1 Loop Unrolling | 16 |
| 3.2.2 Cache Blocking | 17 |
| 3.2.3 Performance Tuning Strategy Based on the Roofline Model | 17 |
| 4 Performance and Scalability Analysis | 18 |
| 4.1 Methodology | 18 |
| 4.2 Benchmarks | 19 |
| 4.3 Performance Evaluation of CMVP | 20 |
| 4.4 Performance Evaluation of CMVP with Performance Tuning | 22 |
| 5 Conclusions | 25 |
| References | 26 |
| I/O Forwarding for Quiet Clusters | 28 |
| 1 Introduction | 29 |
| 2 Operating System Noise | 30 |
| 2.1 So …Who's the Noisy Neighbour? | 31 |
| 2.2 Impact on Applications | 31 |
| 2.3 Mitigation | 32 |
| 2.3.1 Silence Your System | 32 |
| 2.3.2 Embrace Noise | 33 |
| 2.3.3 Synchronize Noise | 33 |
| 2.3.4 Prioritize | 33 |
| 2.3.5 Travel Light | 33 |
| 3 Measuring Noise | 34 |
| 3.1 Test System | 34 |
| 3.2 Fixed Work Quanta Benchmark | 35 |
| 3.3 Fixed Time Quanta Benchmark | 36 |
| 4 I/O Induced Noise | 36 |
| 5 I/O Forwarding | 38 |
| 5.1 I/O Forwarding Architecture | 39 |
| 5.2 System I/O Interceptors: Libsysio | 40 |
| 5.3 I/O Forwarding Protocol: IOD Driver and Server | 41 |
| 5.4 Communication Framework: Portals | 41 |
| 5.5 Using the I/O Forwarding Framework | 42 |
| 5.6 Noise | 42 |
| 5.7 FUSE Driver | 44 |
| 6 Conclusion | 44 |
| References | 45 |
| A Prototype Implementation of OpenCL for SX Vector Systems | 47 |
| 1 Introduction | 48 |
| 2 OpenCL | 48 |
| 3 OpenCL for SX | 49 |
| 4 Early Evaluation and Discussions | 51 |
| 5 Conclusions | 53 |
| References | 55 |
| Distributed Parallelization of Semantic Web Java Applications by Means of the Message-Passing Interface | 57 |
| 1 Introduction | 57 |
| 2 Use Case Description: Random Indexing | 59 |
| 3 Parallelization Strategy | 60 |
| 4 Realization by Means of MPI | 61 |
| 5 Implementation | 63 |
| 6 Application Performance Evaluation | 64 |
| 7 Performance Tailoring: Hybrid MPI-Java Threads Communication Pattern | 66 |
| 8 Final Discussion and Conclusion | 68 |
| References | 69 |
| HPC Systems at JAIST and Development of Dynamic Loop Monitoring Tools Toward Runtime Parallelization | 71 |
| 1 Introduction | 71 |
| 2 Information Environment and HPC Systems at JAIST | 72 |
| 3 Development of Dynamic Loop Monitoring Tools Toward Runtime Parallelization | 74 |
| 3.1 Background and Objectives of Dynamic Loop Monitoring Tools | 75 |
| 3.2 Parallelism and Loop Nest Structures | 75 |
| 3.3 Loop Nest Detection and Loop-Call Context Tree Generation | 76 |
| 3.4 Evaluation of Our L-CCT Generation | 78 |
| 3.4.1 Experiment | 78 |
| 3.4.2 Results | 78 |
| 3.5 Run-Time Data Dependence Analysis | 80 |
| 3.5.1 Motivations and Strategies | 81 |
| 3.5.2 Details of Our Runtime Data Dependence Analysis | 81 |
| 3.5.3 Preliminary Evaluation of Runtime Data Dependence Analysis | 82 |
| 4 Conclusions | 83 |
| References | 83 |
| Part II: Methods and Technologies for Large-Scale Systems | 85 |
| Tree Based Voxelization of STL Data | 86 |
| 1 Introduction | 86 |
| 2 Octree Overview | 88 |
| 3 Mesh Generation | 89 |
| 3.1 Intersection Algorithm and Tree Generation | 90 |
| 3.2 Flooding | 92 |
| 3.3 Boundary Conditions | 92 |
| 3.4 The File Format | 94 |
| 4 Sample Mesh | 95 |
| 5 Outlook | 96 |
| References | 96 |
| An Adaptable Simulation Framework Based on a Linearized Octree | 98 |
| 1 Introduction and Overall Layout of the Apes Framework | 98 |
| 1.1 Used Technologies | 99 |
| 1.2 Components of the Apes Suite | 99 |
| 1.3 Distributed Computing | 101 |
| 2 Related Work | 101 |
| 3 The Distributed Linearized Octree | 102 |
| 3.1 Implementation of the Element Description | 102 |
| 3.2 Element Properties | 104 |
| 3.3 Acting on the Tree | 106 |
| 4 Configuration of Simulation Runs | 107 |
| 5 Usage in Solvers | 107 |
| 5.1 Ateles | 108 |
| 5.2 Musubi | 109 |
| 6 Outlook | 110 |
| References | 110 |
| High Performance Computing for Analyzing PB-Scale Data in Nuclear Experiments and Simulations | 111 |
| 1 Introduction | 111 |
| 2 Large-Scale Data Integrated Analysis System | 112 |
| 3
|