Thread, Multi-Threading and Multi-Processing

A thread is a unit of execution within a process, and it’s primarily related to CPU processing rather than RAM, though it does use memory. The CPU does not store data for long durations. It processes data and instructions but relies on caches (small, fast memory located on or near the CPU) and registers (tiny storage locations within the CPU) for temporary, high-speed data handling. The actual data storage, such as stack and heap, is located in the larger RAM.

A thread is the smallest unit of processing that can be scheduled by an operating system It’s like a lightweight process that exists within a process Multiple threads can exist within the same process, sharing the same resources.

Asynchronous versus synchronous, multi-threading can achieve asynchronous:

For heavy I/O bound tasks such as read, download tasks, not requiring CPU to process/compute, multi-threading is ideal, but it does create lots of threads, which is an overhead.

It’s worth noting difference between multi-threading versus multi-processing, both seemingly concurrent but for different usage. multi-threading (Asynchronous) still handle task one by one, but won’t wait till each task taking time to complete sequentially; while multi-processing achieve running tasks concurrently for high CPU intensive tasks.

The slow way is

the Asynchronous/multi-threading

the concurrent/multi-processing

Threads are executed by CPU cores and CPU cores are made up of transistors in integrated circuits

Here is an example of C++ creating a thread

// C++ example of thread creation
#include <thread>

void threadFunction() {
    // This code runs in a separate thread
}

int main() {
    // Create a new thread
    std::thread t1(threadFunction);
    
    // Wait for thread to finish
    t1.join();
    
    return 0;
}
CPU Die
├── Core 1
│   ├── ALU (Arithmetic Logic Unit)
│   │   └── Thousands/Millions of Transistors
│   ├── Control Unit
│   │   └── Thousands/Millions of Transistors
│   ├── Cache Memory
│   │   └── Millions of Transistors
│   └── Registers
│       └── Hundreds/Thousands of Transistors
└── [Other Cores...]

How Transistors form logic?

basic logic gates:
AND Gate (using transistors)
     ┌─────┐
A ────┤     │
      │ AND ├────Output
B ────┤     │
     └─────┘
complex functions:
Transistors combine to form:
Logic gates (AND, OR, NOT, etc.)
Flip-flops (memory elements)
Arithmetic circuits
Control circuits

When to use multi-threading in Python? I/O-bound operations (file operations, network calls). Use ThreadPoolExecutor for managing thread pools. A special thing about Python is GIL (Global Interpreter Lock). Python’s GIL prevents multiple native threads from executing Python bytecodes simultaneously. Best for I/O-bound tasks rather than CPU-bound tasks.

As to how use Python Threading, here are some examples

1. Thread synchronization with lock
import threading

counter = 0
lock = threading.Lock()

def increment():
    global counter
    for _ in range(100000):
        with lock:  # Thread-safe section
            global counter
            counter += 1

# Create threads
threads = []
for _ in range(5):
    thread = threading.Thread(target=increment)
    threads.append(thread)
    thread.start()

# Wait for all threads
for thread in threads:
    thread.join()

print(f"Final counter value: {counter}")
2. Thread pool executor

from concurrent.futures import ThreadPoolExecutor
import time

def process_item(item):
time.sleep(1) # Simulate processing
return item * item

3. Using ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=3) as executor:
numbers = [1, 2, 3, 4, 5]
results = list(executor.map(process_item, numbers))

print(f"Results: {results}")

Look deeper into the ThreadPoolExecutor: what’s happening under the hood is first creates 3 worker threads, these workers or threads are waiting in a pool for tasks.

Tasks:      [1, 2, 3, 4, 5]
Workers:    [Thread1, Thread2, Thread3]

Execution Flow:
Time 0s:   Thread1 → 1    Thread2 → 2    Thread3 → 3    (4,5 waiting)
Time 1s:   Thread1 → 4    Thread2 → 5    Thread3 done   (all tasks assigned)
Time 2s:   All done

Let’s see another example to use ProcessPoolExecutor:

from concurrent.futures import ProcessPoolExecutor
import time

def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

def find_primes(number):
    return (number, is_prime(number))

def main():
    numbers = range(100000, 100100)  # Check numbers in this range
    
    # Sequential execution for comparison
    start_time = time.time()
    sequential_results = list(map(find_primes, numbers))
    sequential_time = time.time() - start_time
    
    # Parallel execution with ProcessPoolExecutor
    start_time = time.time()
    with ProcessPoolExecutor(max_workers=4) as executor:  # Use 4 processes
        parallel_results = list(executor.map(find_primes, numbers))
    parallel_time = time.time() - start_time
    
    # Print results
    print(f"Sequential time: {sequential_time:.2f} seconds")
    print(f"Parallel time: {parallel_time:.2f} seconds")
    print("\nPrime numbers found:")
    for number, is_prime_number in parallel_results:
        if is_prime_number:
            print(number, end=" ")

if __name__ == '__main__':
    main()

For CPU-intensive tasks, consider using multiprocessing instead of threading due to the GIL limitation. What are examples of CPU-intensive tasks?

Mathematical Computations, Image processing, CPU bound vs IO bound can be illustrated with this codes comparison

# CPU Bound Example
def cpu_bound():
    result = 0
    for i in range(100_000_000):
        result += i
    return result

# I/O Bound Example
def io_bound():
    with open('large_file.txt', 'r') as f:
        data = f.read()  # Waiting for disk read
    return len(data)

Use multi-processing codes:

from multiprocessing import Pool

def cpu_intensive_task(n):
    return sum(i * i for i in range(n))

# Good - Using multiprocessing
if __name__ == '__main__':
    numbers = [10000000] * 4
    with Pool(processes=4) as pool:
        results = pool.map(cpu_intensive_task, numbers)

Note heavy mathematical computation such as matrix multiplication has a huge bottleneck which can be improved by multi/parallel processing. however, even after parallel processing treatment, it is still far less optimal than using numpy dot function, because NumPy optimizes matrix multiplication through several key techniques: BLAS/LAPACK Implementation, NumPy‘s np.dot() (or np.matmul()) functions are implemented in C and optimized for performance.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.