Qburst Logo
Industries
Solutions
Services
Innovation & Insights
Company
Industries
Solutions
Services
Innovation & Insights
Company
PySpark CLI—An Efficient Way to Manage Your PySpark Projects.png
  1. Innovation & Insights
  2. Blog
|
General

PySpark CLI—An Efficient Way to Manage Your PySpark Projects

Jino JossyMehul Agarwal
Jino Jossy, Mehul Agarwal

Latest Posts

  • What Spreadsheets Taught me About the Future of Agentic AI

  • The GCC Evolution: Navigating Strategy and Scale in the AI Era

  • How We Reduced Agent Onboarding Cycles for an Insurance Carrier

  • The Agentic Inbox: How We Solved “Last Mile” of Operational Automation

  • From Scrum to SAFe: Scaling Agile for Complex Teams and Business Agility

In the world of big data analytics, PySpark, the Python API for Apache Spark, has a lot of traction because of its rapid development possibilities. Apart from Python, it provides high-level APIs in Java, Scala, and R. Despite the simplicity of the Python interface, creating a new PySpark project involves the execution of long commands. Take for example the command to create a new project:

$SPARK_HOME/bin/spark-submit \ --master local[*] \ --packages 'com.somesparkjar.dependency:1.0.0' \ --py-files packages.zip \ --files configs/etl_config.json \ jobs/etl_job.py  

It is NOT the most convenient or intuitive method to create a simple file structure.

So is there an easy way to get started with PySpark?

Introducing PySpark CLI— a tool to create and manage end-to-end PySpark projects. With sensible defaults, it helps new users to create projects with short commands. Experienced users can use PySpark CLI to manage their PySpark projects more efficiently.

PySpark CLI generates the project folder structure along with the required configuration files and boilerplate code with which you can dive right into your project. The folder structure is designed for easy understanding and customization so you can make changes suited to the project you’re working on. Even as is, the folder structure is suitable for projects covering various applications. 

We have a video tutorial that shows how you can start your project with PySpark CLI. If you are new to Python and PySpark, watch our Quick Intro to PySpark.

Writing Your First PySpark App

Let’s learn by example. We assume you have PySpark installed already. You can tell PySpark is installed by running the following command in a shell prompt (indicated by the $ prefix):

$pyspark

If PySpark is installed, you should see the version of your installation. If it isn’t, you’ll get an error. This tutorial is written for Spark 2.4.4, which supports Python 2.7.15 and later versions.

Environment Setup

Let’s set up the environment required for working with PySpark projects. For installation on Ubuntu, follow these steps:

  1. Download and install JDK 8 or above. Before you can start with Spark and Hadoop, you need to make sure you have java 8 installed. You can check this by running the following command:  java -version
  2. Download and install the latest distribution of Apache Spark.
  3. Create a directory  “spark” with the following command in your home. mkdir spark
  4. Move spark-2.4.4-bin-hadoop2.7.tgz in the spark directory:
1mv ~/Downloads/spark-2.3.0-bin-hadoop2.7.tgz spark 
2cd spark/ 
3tar -xzvf Spark-2.4.4-bin-hadoop2.7.tgz

After extracting the file go to bin directory of spark and run ./pyspark. It will open the following pyspark shell: pyspark_shell

Configure Apache Spark. Now you should configure it in path so that it can be executed from anywhere.

  1. Open bash_profile file:  vi ~/.bash_profile
  2. Add the following entry: export SPARK_HOME=~/spark/spark-2.4.4-bin-hadoop2.7/ export PATH="$SPARK_HOME/bin:$PATH"
  3. Run the following command to update PATH variable in the current session:source ~/.bash_profile
  4. After next log in you should be able to find pyspark command in path and it can be accessed from any directory.
  5. Check PySpark installation: In your anaconda prompt,or any python supporting cmd, type pyspark, to enter the pyspark shell. To be prepared, it’s best to check it in the python environment from which you run jupyter notebook. You are supposed to see the following: pyspark_shell.

PySpark Shell

Test using the following commands. The output should be [1,4,9,16].

1$ pyspark 
2>>> nums = sc.parallelize([1,2,3,4]) 
3>>> nums.map(lambda x: x*x).collect()

To exit the pyspark shell, press Ctrl Z or use the python command exit().

Installing PySpark CLI

Using Source

git clone https://github.com/qburst/PySparkCLI.git

cd PySparkCLI

pip3 install -e . --user

Using PyPI

pip3 install pyspark-cli

Commands

1. Create a new project: pysparkcli create [project-name]

Run the following code to create your project sample:

pysparkcli create sample -m local[*] -c 2 -t default

master: master is the URL of the cluster it connects to. You can also use -m instead of --master.

project_type: project_type is the type of project you want to create like default, streaming, etc. You can also use -t instead of --project_type.

cores: This controls the number of parallel tasks an executor can run. You can also use -c instead of –cores.

You’ll see the following in your command line:

Completed building project: sample

2. Run the project by path: pysparkcli run [project-path] 

Run the following code to run your project  sample:

1virtualenv --python=/usr/bin/python3.7 sample.env
2source sample.env/bin/activate
3pysparkcli run sample

You’ll see the following in your command line:

1Started running project: sample/
219/11/25 10:37:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using
3builtin-java classes where applicable
4Hello World

3. Initiate the stream: pysparkcli stream [project-path] [stream-file-path]

Run the following code to stream data for project  sample  using twitter_stream file: pysparkcli stream sample twitter_stream

You’ll see the following in your command line:

1(streaming_project_env) ➜  checking git:(docs_develop) ✗ pysparkcli stream test twitter_stream  
2Started streaming of project: test
3Requirement already satisfied: certifi==2019.11.28 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 1))
4Requirement already satisfied: chardet==3.0.4 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 2))
5Requirement already satisfied: idna==2.8 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 3))
6Requirement already satisfied: oauthlib==3.1.0 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 4))
7Requirement already satisfied: py4j==0.10.7 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 5))
8Requirement already satisfied: PySocks==1.7.1 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 6))
9Requirement already satisfied: pyspark==2.4.4 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 7))
10Requirement already satisfied: requests==2.22.0 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 8))
11Requirement already satisfied: requests-oauthlib==1.3.0 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 9))
12Requirement already satisfied: six==1.13.0 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 10))
13Requirement already satisfied: tweepy==3.8.0 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 11))
14Requirement already satisfied: urllib3==1.25.7 in ./streaming_project_env/lib/python3.6/site-packages (from -r test/requirements.txt (line 12))
15Listening on port: 5555

4. Run the test by path: pysparkcli test [project-path]

Run the following code to run all tests for your project  sample: pysparkcli test sample

You’ll see the following in your command line:

1“` Started running test cases for project: sample 19/12/09 14:02:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable /usr/lib/python3.7/socket.py:660: ResourceWarning: unclosed self._sock = None ResourceWarning: Enable tracemalloc to get the object allocation traceback /usr/lib/python3.7/socket.py:660: ResourceWarning: unclosed self._sock = None ResourceWarning: Enable tracemalloc to get the object allocation traceback .
2Ran 1 test in 6.041s OK  “`

Check Version: pysparkcli version.

Conclusion

Even though PySpark CLI can create and manage projects, there are more possibilities to be explored. The following are a few that we think would help the project at the current stage:

  • Custom integration for different databases during the project creation itself.
  • Instead of providing project details as arguments, give a file-based input, like a Yaml file or JSON file.
  • Generate dynamic tests for projects.

Learning is a never-ending process and, with PySpark CLI, we hope that it helps you to start that project that you always wanted but never did.

Do you have some interesting features to suggest or implement? Please do raise them here. Also, check out our contribution guidelines.

 

Latest Posts

  • What Spreadsheets Taught me About the Future of Agentic AI

  • The GCC Evolution: Navigating Strategy and Scale in the AI Era

  • How We Reduced Agent Onboarding Cycles for an Insurance Carrier

  • The Agentic Inbox: How We Solved “Last Mile” of Operational Automation

  • From Scrum to SAFe: Scaling Agile for Complex Teams and Business Agility

Recognized for Growth. Trusted for Impact.

Deloitte Technology Fast 50 India, Winner 2024

Deloitte Fast 50 India, Winner 2024

Dun & Bradstreet

Leading Mid-Corporates of India, 2024

RecognitionImage

Major Contender, QE Specialist Services


Qburst LogoISO
socialLogo
socialLogo
socialLogo
socialLogo
Industries
RetailRealtyHigh-TechHealthcareManufacturing
Solutions
Digital ExperienceIntelligent EnterpriseProduct EngineeringManaged AgentsModernization
Services
Experience DesignDigital EngineeringDigital PlatformsData Engineering & AnalyticsApplied AICloudQuality EngineeringGlobal Capability CentersDigital Marketing
Innovation & Insights
BlogCase StudiesWhitepapersBrochures
Company
LeadershipClientsPartnersCorporate ResponsibilityNews & MediaCareersOur LocationsGrowth Referral
  • Industries
  • Solutions
  • Services
  • Innovation & Insights
  • Company

© QBurst 2026. All Rights Reserved.

Privacy Policy

Cookies & Management

Certifications