Spark distributed system architecture is deployed through virtualisation. In addition to being quick to deploy, this architecture enables the effective usage of a computer’s hardware capacity and the resilient distribution of hardware resources, which reduces hardware costs. This study used the virtualisation technology of Virtual Machine Software to deploy a Spark distributed system and the Hadoop Distributed File System to access data. Data analysis was conducted through a performance analysis of the in-memory computing framework of Spark resilient distributed datasets (RDD). In this research, the two methods of secondary sorting and WordCount combined with Top-K were employed to test performance on a data volume of 300 GB. These two methods were then cross-validated, and the system CPUs, memory, and computing nodes were adjusted according to the experimental phases to determine the optimal hardware configuration. Experimental results verified that using more nodes resulted in more rapid data analysis in a Spark distributed system. However, when processing of a small data volume such as 30 GB was performed, and given that the hardware resources of each node were sufficient, data analysis performance could not be improved further after it had reached a certain threshold.
