Every data scientist I know spends a lot of time handling data that originates in CSV files. You can quickly end up with a mess of CSV files located in your Documents, Downloads, Desktop, and other random folders on your hard drive.

I greatly simplified my workflow the moment I started organizing all my CSV files in my Cloud account. Now I always know where my files are and I can read them directly from the Cloud using JupyterLab (the new Jupyter UI) or my Python scripts.

This article will teach you how to read your CSV files hosted on the Cloud in Python as well as how to write files to that same Cloud account.

I’ll use IBM Cloud Object Storage, an affordable, reliable, and secure Cloud storage solution. (Since I work at IBM, I’ll also let you in on a secret of how to get 10 Terabytes for a whole year, entirely for free.) This article will help you get started with IBM Cloud Object Storage and make the most of this offer. It is composed of three parts:

  1. How to use IBM Cloud Object Storage to store your files;
  2. Reading CSV files in Python from Object Storage;
  3. Writing CSV files to Object Storage (also in Python of course).

The best way to follow along with this article is to go through the accompanying Jupyter notebook either on Cognitive Class Labs (our free JupyterLab Cloud environment) or downloading the notebook from GitHub and running it yourself. If you opt for Cognitive Class Labs, once you sign in, you will able to select the IBM Cloud Object Storage Tutorial as shown in the image below.

IBM Cloud Object Storage Tutorial

 

What is Object Storage and why should you use it?

The “Storage” part of object storage is pretty straightforward, but what exactly is an object and why would you want to store one? An object is basically any conceivable data. It could be a text file, a song, or a picture. For the purposes of this tutorial, our objects will all be CSV files.

Unlike a typical filesystem (like the one used by the device you’re reading this article on) where files are grouped in hierarchies of directories/folders, object storage has a flat structure. All objects are stored in groups called buckets. This structure allows for better performance, massive scalability, and cost-effectiveness.

By the end of this article, you will know how to store your files on IBM Cloud Object Storage and easily access them using Python.

 

Provisioning an Object Storage Instance on IBM Cloud

Sign up or log in with your IBM Cloud account here (it’s free) to begin provisioning your Object Storage instance. Feel free to use the Lite plan, which is free and allows you to store up to 25 GB per month. You can customize the Service Name if you wish, or just leave it as the default. You can also leave the resource group to the default. Resource groups are useful to organize your resources on IBM Cloud, particularly when you have many of them running. When you’re ready, click the Create button to finish provisioning your Object Storage instance.

Creating an Object Storage instance

Working with Buckets

Since you just created the instance, you’ll now be presented with options to create a bucket. You can always find your Object Storage instance by selecting it from your IBM Cloud Dashboard.

There’s a limit of 100 buckets per Object Storage instance, but each bucket can hold billions of objects. In practice, how many buckets you need will be dictated by your availability and resilience needs.

For the purposes of this tutorial, a single bucket will do just fine.

Creating your First Bucket

Click the Create Bucket button and you’ll be shown a window like the one below, where you can customize some details of your Bucket. All these options may seem overwhelming at the moment, but don’t worry, we’ll explain them in a moment. They are part of what makes this service so customizable, should you have the need later on.

Creating an Object Storage bucket

If you don’t care about the nuances of bucket configuration, you can type in any unique name you like and press the Create button, leaving all other options to their defaults. You can then skip to the Putting Objects in Buckets section below. If you would like to learn about what these options mean, read on.

Configuring your bucket

Resiliency Options

Resiliency OptionDescription
Characteristics
Cross RegionYour data is stored across three geographic regions within your selected locationHigh availability and very high durability
RegionalYour data is stored across three different data centers within a single geographic regionHigh availability and durability, very low latency for regional users
Single Data CenterYour data is stored across multiple devices within a single data centerData locality

Storage Class Options

Frequency of Data AccessIBM Cloud Object Storage Class
ContinualStandard
Weekly or monthlyVault
Less than once a monthCold Vault
UnpredictableFlex

Feel free to experiment with different configurations, but I recommend choosing “Standard” for your storage class for this tutorial’s purposes. Any resilience option will do.

After you’ve created your bucket, store the name of the bucket into the Python variable below (replace cc-tutorial with the name of your bucket) either in your Jupyter notebook or a Python script.

Creating Service Credentials

To access your IBM Cloud Object Storage instance from anywhere other than the web interface, you will need to create credentials. Click the New credential button under the Service credentials section to get started.

In the next window, select Manager as your role, and add {"HMAC":true} to the Add Inline Configuration Parameters (Optional) field. You can leave all other fields as their defaults and click the Add button to continue.

You’ll now be able to click on View credentials to obtain the JSON object containing the credentials you just created. You’ll want to store everything you see in a credentials variable like the one below (obviously, replace the placeholder values with your own). Take special note of your access_key_id and secret_access_key which you will need for the Cyberduck section below.

Note: If you’re following along within a notebook be careful not to share this notebook after adding your credentials!

 

Putting Objects in Buckets

There are many ways to add objects to your bucket, but we’ll start by taking a look at two simple ways: the IBM Cloud web interface and Cyberduck.

IBM Cloud Web Interface

You can add a CSV file of your choice to your newly created bucket through the web interface by either clicking the Add objects button, or dragging and dropping your CSV file into the IBM Cloud window.

If you don’t have an interesting CSV file handy, I recommend downloading FiveThirtyEight’s 2018 World Cup predictions.

Cyberduck

Cyberduck is a free cloud storage browser for Mac OS and Windows. It allows you to easily manage all of the files in all of your object storage instances. After downloading, installing, and starting Cyberduck, create a new bookmark by pressing +Shift+B on Mac OS or Ctrl+Shift+B on Windows. A window will pop up with some bookmark configuration options. Select the Amazon S3 option from the dropdown and fill in the form as follows:

  • Nickname: enter anything you like.
  • Server: enter your service endpoint. You can choose any public endpoint here. For your convenience, I recommend one of these:
    • s3-api.us-geo.objectstorage.softlayer.net (If you live in the Americas)
    • s3.eu-geo.objectstorage.softlayer.net (if you live in Europe)
    • s3.ap-geo.objectstorage.softlayer.net (if you live in Asia)
  • Access Key ID: enter the access_key_id you created above in the Creating Service Credentials section.

Close the window and double-click on your newly created bookmark. You will be asked to log in. Enter the secret_access_key_id you created above in the Creating Service Credentials section and click Login.

You should now see a file browser pane with the bucket you created in the Working with Buckets section. If you added a file in the previous step, you should also be able to expand your bucket to view the file. Using the action dropdown or the context menu (right-click on Windows, control-click on Mac OS).

You can add files to your buckets by dragging and dropping them onto this window.

Whether you use the IBM Cloud web interface or Cyberduck, assign the name of the CSV file you upload to the variable filename below so that you can easily refer to it later.

 

Reading CSV files from Object Storage with Cyberduck

Once you have successfully accessed an object storage instance in Cyberduck using the above steps, you can download files by double-clicking them in Cyberduck’s file browser. You can also generate links to your files by selecting the Open/Copy Link URL option.

By default your files are not publicly accessible, so selecting a URL that is not pre-signed will not allow the file to be downloaded. Pre-signed URLs do allow for files to be downloaded, but the link will eventually expire. If you want a permanently available public link to one of your files, you can select the Info option for that file and add READ permissions for Everyone under the permissions section.

 

After changing this setting you can share the URL (without pre-signing) and anyone with the link will be able to download it, either by opening the link in their web browser, or by using a tool like wget from within your Jupyter notebook, e.g.

Reading CSV files from Object Storage using Python

The recommended way to access IBM Cloud Object Storage with Python is to use the ibm_boto3 library, which we’ll import below.

The primary way to interact with IBM Cloud Object Storage through ibm_boto3 is by using an ibm_boto3.resource object. This resource-based interface abstracts away the low-level REST interface between you and your Object Storage instance.

Run the cell below to create a resource Python object using the IBM Cloud Object Storage credentials you filled in above.

After creating a resource object, we can easily access any of our Cloud objects by specifying a bucket name and a key (in our case the key is a filename) to our resource.Object method and calling the get method on the result. In order to get the object into a useful format, we’ll do some processing to turn it into a pandas dataframe.

 

We’ll make this into a function so we can easily use it later:

Adding files to IBM Cloud Object Storage with Python

IBM Cloud Object Storage’s web interface makes it easy to add new objects to your buckets, but at some point you will probably want to handle creating objects through Python programmatically. The put_object method allows you to do this.

In order to use it you will need:

  1. The name of the bucket you want to add the object to;
  2. A unique name (Key) for the new object;
  3. A bytes-like object, which you can get from:
    • urllib‘s request.urlopen(...).read() method, e.g.
      urllib.request.urlopen('https://example.com/my-csv-file.csv').read()
    • Python’s built-in open method in binary mode, e.g.
      open('myfile.csv', 'rb')

To demonstrate, let’s add another CSV file to our bucket. This time we’ll use FiveThirtyEight’s airline safety dataset.

You can now easily access your newly created object using the function we defined above in the Reading from Object Storage using Python section.

Get 10 Terabytes of IBM Cloud Object Storage for free

You now know how to read from and write to IBM Cloud Object Storage using Python! Well done. The ability to pragmatically read and write files to the Cloud will be quite handy when working from scripts and Jupyter notebooks.

If you build applications or do data science, we also have a great offer for you. You can apply to become an IBM Partner at no cost to you and receive 10 Terabytes of space to play and build applications with.

You can sign up by simply filling the embedded form below. If you are unable to fill the form, you can click here to open the form in a new window.

Just make sure that you apply with a business email (even your own domain name if you are a freelancer) as free email accounts like Gmail, Hotmail, and Yahoo are automatically rejected.