[VDS] 7 Tips I Would Have Given Me In 2023
By Mathieu Guglielmino, Jan 1 2024
➡️ Read online for code samples as text
7 Tips I Would Have Given Me In 2023
A digest more focused on code and software this first day of 2024.
1. Work in a Virtual Environment
A virtual environment serves as an isolated instance of Python. You can have as many as you want on your machine, and they are basically a folder with libs in your project folder.
It’s handy for keeping track of what a project needs, and it’s a good idea to use one whenever you’re starting something new.
$ pip install virtualenv # or sudo apt-get install python3-virtualenv
$ python -m virtualenv venv # create a virtual environement called "venv" in the current directory
$ source venv/bin/activate # activate the environment
Once the virtual environment is sourced, your terminal should display (venv)
before the commands you type:
(venv) $ python --version # the Python version of the virtual env
(venv) $ pip freeze # display the Python packages installed in your virtual environment
(venv) $ pip freeze > requirements.txt # write the packages with version of your venv in requirements.txt
This requirements.txt
makes it easy to share the dependencies of your code without sending out the Python libs.
You can install all the packages from the requirements.txt
file in a fresh virtual environment with the -r
flag of pip install
:
(venv) $ pip install -r requirements.txt
For notebook lovers, you may recall you run your Python code in what is called a “Kernel”. You may want the Python from the kernel and the one from your virtual env to be the same one:
(venv) $ pip install ipykernel
(venv) $ ipykernel install --user --name=my_project
Run the ipykernel
command in your virtual env to make it accessible in Jupyter as the my_project
kernel.
2. Check Out Your Columns Names
Exploratory data analysis is an emergent activity, so it’s hard to know in advance what you may find out.
However, I always check for my data columns at checkpoints:
- after I load the data (
data_loaded
), - after processing (
data_proc
), - before I create charts (
data_charts
), - before I write the data somewhere (
data_output
)
In Python this can be done with the assert
keyword. Assuming your data is a pandas dataframe:
assert df_loaded.columns == ['id', 'category', 'valueA', 'valueB']
If the expression returns false, an AssertionError
is generated.
3. Will 2024 Be The Year Of Tests?
Tests make you write cleaner and more modular code, which improves readability and maintainability, and they provide a common understanding of how your code should behave.
If you want to write tests, in Python you have the choice:
- Unittest or Pytest,
- Doctest: searches for pieces of text that look like interactive Python sessions, and then executes those sessions to verify that they work exactly as shown.
import unittest
class TestStringMethods(unittest.TestCase):
def test_split(self):
s = 'hello world'
self.assertEqual(s.split(), ['hello', 'world'])
# check that s.split fails when the separator is not a string
with self.assertRaises(TypeError):
s.split(2)
if __name__ == '__main__':
unittest.main()
4. Print is Good; Log Is Better
Logs are at the center of any monitoring factory. For data manipulation, you may want to record somewhere, like in a file, the steps of your pipeline and eventual errors.
Handlers such as logging.FileHandler
or logging.StreamHandler
are useful to output the logs to the stream and / or to file:
import logging
class CustomLogger:
def __init__(self,
log_to_file=True,
log_to_stream=True,
name: str = None):
self.log_to_file = log_to_file
self.log_to_stream = log_to_stream
self.name = name
self.logger = self._configure_logger()
def _configure_logger(self):
# Reload previous instances of logger
logging.shutdown()
reload(logging)
# Configure the custom logger
if self.name is None:
self.name = "log"
log_filename = f"{self.name}-{datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}.log"
logging.re
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
# Create handlers to display log messages in a file
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
if self.log_to_file:
file_handler = logging.FileHandler(log_filename)
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
if self.log_to_stream:
stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
return logger
def get_logger(self):
return self.logger
5. What Does This Function Do? Use Docstrings And Types
There are no excuses not to write docstrings in 2024.
Documentation is seamlessly integrated in most code editors, such as VS Code, and it’s never been easier to ask ChatGPT to write the docstring of a function.
This is the docstring of the previous CustomLogger
class, with clear formatting and usage example:
class CustomLogger:
def __init__(self,
log_to_file=True,
log_to_stream=True,
name: str = None):
"""
Initialize a custom logger.
This class configures a custom logger with log messages both in a log file and the terminal. It uses the current timestamp to create a unique log file.
Args
---
- log_to_file (bool, optional): If True, log messages will be written to a log file.
- log_to_stream (bool, optional): If True, log messages will be streamed to the terminal.
Example
---
>>> logger = CustomLogger(log_to_file=True, log_to_stream=True).get_logger()
>>> logger.info("This is an example log message.")
"""
6. Define Your Input Parameters In YAML Files
You shouldn’t bury the parameters of your pipeline directly in the code, but rather write self-explanatory configuration files, lightweight and easy to share.
While JSON is a commonly used file format, it comes with certain drawbacks that make it less ideal compared to YAML:
- Human-Readable Syntax: YAML uses indentation and relies on whitespace for structure, making it more readable and visually appealing to humans;
- Minimal Punctuation: YAML requires fewer punctuation characters compared to JSON;
- Comments Support: YAML supports comments, allowing explanatory notes directly in the configuration file.
# Data Science Configuration
project_info:
name: "DataAnalysisProject"
description: "Exploratory data analysis and modeling"
data_sources:
name: "input_data"
path: "data/input_data.csv"
columns:
- id
- category
- value
output:
results_folder: "output/results.csv"
I usually have a Config
class that I adapt for each project:
class Config:
"""Configuration Python object. Useful to parse YAML.
>>> conf_path = 'config.yaml'
>>> conf = Config.from_path(conf_path)
>>> conf.get('project_info').get('name')
DataAnalysisProject
"""
def __init__(self, config_data):
self._config_data = config_data
def get_config_data(self):
return self._config_data
def get(self, key, default=None):
return self._config_data.get(key, default)
@classmethod
def from_path(cls, file_path):
try:
with open(file_path, 'r') as file:
config = yaml.safe_load(file)
except FileNotFoundError as e:
raise FileNotFoundError(f"Error: Configuration file '{file_path}' not found.") from e
except yaml.YAMLError as e:
raise yaml.YAMLError(f"Error parsing YAML configuration file '{file_path}': {e}") from e
return cls(config)
7. Maintain A Design System to Look Beyond Default
Design systems document the looks of what you build.
Nothing looks more by-default than the default colors of matplotlib
If you use Python, you can keep a constants.py
and define a color palette, or a dictionary of colors for some ominous business categories you chart all the time:
palette = [
'#E70099', # pink (accent)
'#EB8D00', # yellow
'#00AEA4', # turquoise
...
]
discrete_color_map = {
'OIL': '#000918',
'RENEWABLES': '#49AA56',
}
If you’re not familiar with design systems, you can take a look at The Economist Visual Styleguide.
2023 Reviews
You can also check the following visual reviews of 2023:
- 2023 The Year in Graphics (Bloomberg)
- Axios Visuals 2023 in review
- The Pudding Cup
- 2023: The Year in Visual Stories and Graphics. (NYTimes)
- FigData 2023 (Le Figaro on X)
- DataWrapper Vis Dispatch
- 2023 The Year In Graphics (WSJ)
- 2023 USA Today
- Best of Data and Graphics: 2023 (LA Times)
See you next week, Mathieu Guglielmino