Today, we announced RStudio on Amazon SageMaker, the first machine learning (ML) integrated development environment (IDE) in the cloud for data scientists working in R. The open-source language R and its rich ecosystem with more than 18,000 packages has been a top choice for statisticians, quant analysts, data scientists, and ML engineers. RStudio on SageMaker makes it easy for data scientists to run statistical analysis, build ML models, and create data science content on a centralized environment for the team without worrying about the compute infrastructure.
Along with the RStudio Workbench as part of the RStudio suite for R developers are RStudio Connect and RStudio Package Manager. RStudio Connect makes it easy to surface ML and data science insights off data scientists’ complicated work and put it in the hands of decision-makers. RStudio Connect is designed to allow data scientists to publish insights, dashboards, and web applications. RStudio Connect also makes hosting and managing content simple and scalable for wide consumption.
RStudio Package Manager helps organize and centralize R packages across ML teams and organizations. As data scientists develop their ML models, they need various packages with different capabilities for their ML use cases in RStudio. Managing the sources and versions of these packages and numerous public repositories manually for enterprise users is prone to errors and is also time-consuming. RStudio Package Manager mitigates these issues by managing the package repository centrally for your organization so that data scientists can install packages quickly and securely, and ensure project reproducibility and repeatability. Security and reproducibility are the most important aspects in regulated industries such as healthcare and finance.
In this post, we first show you how to architect and deploy RStudio Connect and RStudio Package Manager with a well-architected solution in AWS. We then show you how to use RStudio Connect and RStudio Package Manager from RStudio on SageMaker. We use an UCI breast cancer dataset to build out several types of ML content in R language in RStudio on SageMaker. The ML content we demonstrate in the post includes R Markdown and an R Shiny application
The solution architecture is based on professional versions of RStudio Connect and RStudio Package Manager Docker containers. RStudio Connect and RStudio Package Manager are configured across two Availability Zones for high availability. Both RStudio Connect and RStudio Package Manager containers support automatic scaling to handle incoming traffic depending on the incoming number of requests, memory, and CPU usage within the containers.
Container images are stored and fetched from Amazon Elastic Container Registry (Amazon ECR) with vulnerability scan enabled. Vulnerability issues should be addressed before deploying the images.
The following diagram illustrates the solution architecture.
The following are the steps in the solution workflow:
We use AWS Cloud Development Kit (AWS CDK) for Python to develop the infrastructure code and store the code in an AWS CodeCommit repository, so that AWS CodePipeline can integrate the AWS CDK stacks for automated builds.
The deployment code utilizes Route 53 public hosted zones to service the RStudio Connect and RStudio Package Manager on publicly accessible URLs. You can use Route 53 private hosted zones for the RStudio Connect and RStudio Package Manager containers with an internal ALB, which provides private endpoints for users coming from RStudio on SageMaker in a VPC-only connectivity mode. This means you don’t need a preexisting public domain in your AWS account. However, you need to fetch the public Docker images (RStudio Connect, RStudio Package Manager) and store those in a private Amazon ECR repository and point the deployment code to those images for the infrastructure build.
If all communications between AWS services must stay within AWS, you can use AWS PrivateLink to configure VPC endpoints for AWS services. AWS PrivateLink makes sure that inter-service traffic is not exposed to the internet for AWS service endpoints.
You can also refer to the RStudio Team solution from RStudio to learn how to deploy an RStudio technology stack on Amazon EC2 in AWS as an alternative to the solution discussed in this post.
To deploy the AWS CDK stacks from the source code, you need to review and perform the prerequisites described in the accompanying GitHub repository to make sure you have the necessary resources to proceed.
The pipeline name is RSC-RSPM-App-Pipeline-
RStudio Package Manager helps with enabling consistency and standardization of R packages across an organization. In RStudio Package Manager, an IT administrator can include an approved package in the repository. Multiple groups can be created to have access to different packages or package versions. RStudio Package Manager also handles all the updating and versioning of the packages. The administrator can enable automatic updates to the packages, or can also configure RStudio Package Manager in a way that the packages can only be updated manually, which provides more isolation between RStudio Package Manager and the CRAN service.
We can create a repository that pulls the packages from the RStudio CRAN by using the following commands. We need to SSH into RStudio Package Manager using Amazon ECS Exec to run these commands.
# Initiate a sync rspm sync –wait # Create a repository: rspm create repo –name=dev-cran –description=’Access CRAN packages’ # Subscribe the repository to the cran source rspm subscribe –repo=dev-cran –source=cran
The commands create a repository and subscribe it to the built-in source named cran. When this is complete, the dev-cran repository is available in the web interface of RStudio Package Manager, as shown in the following screenshot. This web interface is accessible by the administrator as well as the users who have the URL for it.
In addition to serving CRAN packages, repositories can be created to distribute local packages, Git packages, local packages along with CRAN packages, a subset of approved CRAN and local packages, and bleeding edge packages from GitHub. For further details on how to create repositories, see Serving CRAN Packages. In addition, RStudio Package Manager supports Bioconductor. Bioconductor is a commonly used ecosystem of R packages in life sciences. We can combine Bioconductor packages with CRAN as well as local packages in RStudio Package Manager.
In the web interface of RStudio Package Manager, on the Setup tab, you can choose a repository by date in a calendar view. You can also choose whether to use the latest version of the packages, or freeze the packages to a particular snapshot, as shown in the following screenshot.
On the Setup tab, we can also see what system prerequisites might be needed for the repository’s packages, along with the commands to install them.
When creating a SageMaker domain with RStudio, you have an option to set a default RStudio Connect server and RStudio Package Manager repository for all users in your SageMaker domain. During the SageMaker domain creation process, as detailed in the Create a SageMaker domain with RStudio section in Getting Started with RStudio on Amazon SageMaker, you can configure default RStudio Connect and RStudio Package Manager URLs for all user profiles in Step 3: RStudio settings. For RStudio Connect, enter the RStudio Connect server URL. For RStudio Package Manager, enter a CRAN or a Bioconductor repository.
The default URLs are configured and saved in /etc/rstudio/rsession.conf for all users on RStudio on SageMaker. You can verify the default repository in the R console with options(‘repos’). You should see a repository pointing to your RStudio Package Manager. As for the default RStudio Connect URL, it’s automatically populated when you one-click publish a piece of R content.
If you already have a working RStudio on SageMaker and want to use a different repository, you can configure your R session in RStudio on SageMaker to use a repository from your RStudio Package Manager with the following steps:
Now, the packages that we install in RStudio are sourced from the selected repository from your RStudio Package Manager server. You can verify it with options(‘repos’) or by installing a package and see where it is pulling from. For more details, see Checking For Success.
If you already have a working RStudio on SageMaker and want to use a different RStudio Connect server than the default, complete the following steps:
A new page appears to ask you to log in with an account if this is the first time.
You should see you RStudio Connect user profile and server URL in the list.
For more information, see Connect your RStudio Account, and Connecting: RStudio IDE.
Now the RStudio Connect server is successfully connected to the RStudio on Amazon SageMaker. We’re ready to build some great content and publish.
You can easily create an analysis within RStudio on Amazon SageMaker and push-button publish it to your RStudio Connect so that your collaborators can consume your analysis. For this post, we use a UCI breast cancer dataset from mlbench to walk through some of the common use cases of publication: R Markdown and Shiny app.
R Markdown is a great tool to run your analyses in R as part of a markdown file and share in RStudio Connect. In rsconnect_rmarkdown/breast_cancer_eda.Rmd, we perform two simple analyses and plotting on the dataset along with the texts in markdown:
“`{r breastcancer} data(BreastCancer) df <- BreastCancer # convert input values to numeric for(i in 2:10) { df[,i] <- as.numeric(as.character(df[,i])) } summary(df) ``` ```{r cl_thickness, echo=FALSE} ggplot(df, aes(x=Cl.thickness))+ geom_histogram(color="black", fill="white", binwidth = 1)+ facet_grid(Class ~ .) ```
We can preview the file by choosing Knit and publish it to RStudio Connect by choosing Publish.Besides R Markdown, more often than not, you’re building an interactive application or dashboard with Shiny. Let’s look at how we can publish Shiny apps from RStudio on Amazon SageMaker to RStudio Connect.
Shiny is an R package that makes it easy to create interactive web applications programmatically. It’s popular among data scientists to share their analyses and models through a Shiny application to their stakeholders. In rsconnect_shiny/breast-cancer-app/, we develop an ML model in breast_cancer_modeling.r and create a web application to allow users to interact with the data and ML model.
To publish, open app.R and choose Publish. Select both app.R and breast_cancer_modeling.r to publish.
In the application, you can change two features to visualize in the plot and select the data points in the plot to see actual data and model predictions of whether they are benign or malignant cancer cases. By sliding the probability threshold, you can interact with the model and get a different classification counts. You can see the dashboard in action in the following screenshot.
In this post, we showed you how to deploy RStudio Connect and RStudio Package Manager servers in AWS with an architecture based on AWS Fargate and Amazon ECS, using AWS CDK. With RStudio Connect and RStudio Package Manager running in the cloud, we showed you how to use them from RStudio on Amazon SageMaker. Then we demonstrated how to deploy R-based materials such as R Markdown and Shiny applications to the RStudio Connect instance based on a breast cancer prediction use case.
Having an RStudio Connect instance in the cloud not only enables your ML and data science teams to collaborate more effectively, but also makes sharing ML insights across stakeholders and business units much easier. This in turn promotes the use of ML in your organization for a better business outcome. With RStudio Package Manager, you can quickly and securely manage, serve, and install R packages from trusted sources to ensure project reproducibility.
You can learn more about RStudio on SageMaker from a data scientist’s perspective in the post Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists. You can also learn more about how to set up and administer RStudio on SageMaker in the post Getting started with RStudio on Amazon SageMaker. To learn more about Amazon SageMaker Studio, the first IDE for ML in the cloud, see Amazon SageMaker Studio.
Michael Hsieh is a Senior AI/ML Specialist Solutions Architect. He works with customers to advance their ML journey with a combination of Amazon Machine Learning offerings and his ML domain knowledge. As a Seattle transplant, he loves exploring the great mother nature the region has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at the Shilshole Bay.
Chayan Panda is a Cloud Infrastructure Architect. He provides advisory services and thought leadership to AWS customers on robust solution design for cloud migrations, cloud infrastructure (security, network, DevOps), Greenfield platform implementations, big data/AI/ML, and serverless and database solutions. When he is not obsessing about customers, he enjoys a short run, music, a book, or travel with his family.
Farooq Sabir is a Senior AI/ML Specialist Solutions Architect. He helps customers solve their business problems using data science, machine learning, and artificial intelligence.