datascience

[VDS] 7 Tips I Would Have Given Me In 2023

By Mathieu Guglielmino, Jan 1 2024

Color palette ➡️ Read online for code samples as text

7 Tips I Would Have Given Me In 2023

A digest more focused on code and software this first day of 2024.

1. Work in a Virtual Environment

A virtual environment serves as an isolated instance of Python. You can have as many as you want on your machine, and they are basically a folder with libs in your project folder.

It’s handy for keeping track of what a project needs, and it’s a good idea to use one whenever you’re starting something new.

$ pip install virtualenv # or sudo apt-get install python3-virtualenv
$ python -m virtualenv venv # create a virtual environement called "venv" in the current directory
$ source venv/bin/activate # activate the environment

Once the virtual environment is sourced, your terminal should display (venv) before the commands you type:

(venv) $ python --version # the Python version of the virtual env
(venv) $ pip freeze # display the Python packages installed in your virtual environment
(venv) $ pip freeze > requirements.txt # write the packages with version of your venv in requirements.txt

This requirements.txt makes it easy to share the dependencies of your code without sending out the Python libs.

You can install all the packages from the requirements.txt file in a fresh virtual environment with the -r flag of pip install:

(venv) $ pip install -r requirements.txt

For notebook lovers, you may recall you run your Python code in what is called a “Kernel”. You may want the Python from the kernel and the one from your virtual env to be the same one:

(venv) $ pip install ipykernel
(venv) $ ipykernel install --user --name=my_project

Run the ipykernel command in your virtual env to make it accessible in Jupyter as the my_project kernel.

2. Check Out Your Columns Names

Exploratory data analysis is an emergent activity, so it’s hard to know in advance what you may find out.

However, I always check for my data columns at checkpoints:

after I load the data (data_loaded),
after processing (data_proc),
before I create charts (data_charts),
before I write the data somewhere (data_output)

In Python this can be done with the assert keyword. Assuming your data is a pandas dataframe:

assert df_loaded.columns == ['id', 'category', 'valueA', 'valueB']

If the expression returns false, an AssertionError is generated.

3. Will 2024 Be The Year Of Tests?

Tests make you write cleaner and more modular code, which improves readability and maintainability, and they provide a common understanding of how your code should behave.

If you want to write tests, in Python you have the choice:

Unittest or Pytest,
Doctest: searches for pieces of text that look like interactive Python sessions, and then executes those sessions to verify that they work exactly as shown.

import unittest

class TestStringMethods(unittest.TestCase):
    def test_split(self):
        s = 'hello world'
        self.assertEqual(s.split(), ['hello', 'world'])
        # check that s.split fails when the separator is not a string
        with self.assertRaises(TypeError):
            s.split(2)

if __name__ == '__main__':
    unittest.main()

4. Print is Good; Log Is Better

Logs are at the center of any monitoring factory. For data manipulation, you may want to record somewhere, like in a file, the steps of your pipeline and eventual errors.

Handlers such as logging.FileHandler or logging.StreamHandler are useful to output the logs to the stream and / or to file:

import logging

class CustomLogger:
	def __init__(self, 
		log_to_file=True, 
		log_to_stream=True, 
		name: str = None):
	self.log_to_file = log_to_file
	self.log_to_stream = log_to_stream
	self.name = name
	self.logger = self._configure_logger()
	
	def _configure_logger(self):
        # Reload previous instances of logger
        logging.shutdown()
        reload(logging)

        # Configure the custom logger
        if self.name is None:
            self.name = "log"
        log_filename = f"{self.name}-{datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}.log"
        logging.re
        logger = logging.getLogger(__name__)
        logger.setLevel(logging.INFO)

        # Create handlers to display log messages in a file
        formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

        if self.log_to_file:
            file_handler = logging.FileHandler(log_filename)
            file_handler.setFormatter(formatter)
            logger.addHandler(file_handler)

        if self.log_to_stream:
            stream_handler = logging.StreamHandler()
            stream_handler.setFormatter(formatter)
            logger.addHandler(stream_handler)

        return logger
        
    def get_logger(self):
        return self.logger

5. What Does This Function Do? Use Docstrings And Types

There are no excuses not to write docstrings in 2024.

Documentation is seamlessly integrated in most code editors, such as VS Code, and it’s never been easier to ask ChatGPT to write the docstring of a function. custom logger documentation in vscode

This is the docstring of the previous CustomLogger class, with clear formatting and usage example:

class CustomLogger:
	def __init__(self, 
		log_to_file=True, 
		log_to_stream=True, 
		name: str = None):
	"""
	Initialize a custom logger.
	
	This class configures a custom logger with log messages both in a log file and the terminal. It uses the current timestamp to create a unique log file.
	
	Args
	---
	- log_to_file (bool, optional): If True, log messages will be written to a log file.
	- log_to_stream (bool, optional): If True, log messages will be streamed to the terminal.
	
	Example
	---
	>>> logger = CustomLogger(log_to_file=True, log_to_stream=True).get_logger()
	>>> logger.info("This is an example log message.")
	"""

6. Define Your Input Parameters In YAML Files

You shouldn’t bury the parameters of your pipeline directly in the code, but rather write self-explanatory configuration files, lightweight and easy to share.

While JSON is a commonly used file format, it comes with certain drawbacks that make it less ideal compared to YAML:

Human-Readable Syntax: YAML uses indentation and relies on whitespace for structure, making it more readable and visually appealing to humans;
Minimal Punctuation: YAML requires fewer punctuation characters compared to JSON;
Comments Support: YAML supports comments, allowing explanatory notes directly in the configuration file.

# Data Science Configuration
project_info:
  name: "DataAnalysisProject"
  description: "Exploratory data analysis and modeling"

data_sources:
	name: "input_data"
    path: "data/input_data.csv"
    columns:
	- id
	- category
	- value

output:
  results_folder: "output/results.csv"

I usually have a Config class that I adapt for each project:

class Config:
	"""Configuration Python object. Useful to parse YAML.
	
	>>> conf_path = 'config.yaml'
	>>> conf = Config.from_path(conf_path)
	>>> conf.get('project_info').get('name')
	DataAnalysisProject
	"""
	
    def __init__(self, config_data):
        self._config_data = config_data
	def get_config_data(self):
        return self._config_data
    def get(self, key, default=None):
        return self._config_data.get(key, default)
    
    @classmethod
    def from_path(cls, file_path):
        try:
            with open(file_path, 'r') as file:
                config = yaml.safe_load(file)
        except FileNotFoundError as e:
            raise FileNotFoundError(f"Error: Configuration file '{file_path}' not found.") from e
        except yaml.YAMLError as e:
            raise yaml.YAMLError(f"Error parsing YAML configuration file '{file_path}': {e}") from e
        return cls(config)

7. Maintain A Design System to Look Beyond Default

Design systems document the looks of what you build.

Nothing looks more by-default than the default colors of matplotlib

If you use Python, you can keep a constants.py and define a color palette, or a dictionary of colors for some ominous business categories you chart all the time:

palette = [
	'#E70099', # pink (accent)
	'#EB8D00', # yellow
	'#00AEA4', # turquoise
	...
]

discrete_color_map = {
	'OIL': '#000918',
	'RENEWABLES': '#49AA56',
}

color palette

If you’re not familiar with design systems, you can take a look at The Economist Visual Styleguide.

2023 Reviews

You can also check the following visual reviews of 2023:

See you next week, Mathieu Guglielmino

< Back to reviews

VDS Digest

A weekly roundup of the visual press.