January 2025 – 代码学习

Generating a billion records in Cassandra can be accomplished using various methods, including using scripts, data generation tools, or custom applications. Below are some approaches you can take to generate a large dataset for testing or benchmarking purposes.
Method 1: Using CQLSH with a Python Script
You can use a Python script to generate and insert a billion records into Cassandra. This method uses the cassandra-driver library to connect to your Cassandra cluster.
Prerequisites
Install Cassandra Driver for Python:
Make sure you have the Cassandra driver installed. You can install it using pip:
pip install cassandra-driver
2. Set Up Your Cassandra Keyspace and Table:
Create a keyspace and a table in Cassandra where you will insert the records.
CREATE KEYSPACE test_keyspace WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 1 };

CREATE TABLE test_keyspace.test_table ( id UUID PRIMARY KEY, name TEXT, age INT );
Python Script to Generate Records
Here’s a sample Python script that generates and inserts a billion records:
from cassandra.cluster import Cluster import uuid import random


# Connect to Cassandra

cluster = Cluster(['127.0.0.1'])  # Replace with your Cassandra node IP

session = cluster.connect('test_keyspace')
# Prepare the insert statement

insert_stmt = session.prepare("INSERT INTO test_table (id, name, age) VALUES (?, ?, ?)")
# Generate and insert records

for i in range(1, 1000000001):  # 1 billion records

    record_id = uuid.uuid4()

    name = f"Name_{i}"

    age = random.randint(18, 99)
    session.execute(insert_stmt, (record_id, name, age))
    if i % 100000 == 0:  # Print progress every 100,000 records

        print(f"Inserted {i} records")

# Close the session and cluster connection session.shutdown() cluster.shutdown()
Method 2: Using Apache Spark
If you have Apache Spark set up, you can use it to generate and insert a large number of records into Cassandra efficiently.
Prerequisites
1. Set Up Spark with Cassandra Connector:
Make sure you have the Spark Cassandra Connector. You can include it in your Spark job using the following Maven dependency:
com.datastax.spark spark-cassandra-connector_2.12 3.1.0
Spark Job to Generate Records
Here’s a sample Spark job in Scala to generate and insert records:
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.cassandra._


object GenerateRecords {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()

      .appName("Generate Records")

      .config("spark.cassandra.connection.host", "127.0.0.1") // Replace with your Cassandra node IP

      .getOrCreate()
    import spark.implicits._
    // Generate a DataFrame with 1 billion records

    val records = (1 to 1000000000).map(i => (java.util.UUID.randomUUID(), s"Name_$i", scala.util.Random.nextInt(82) + 18))

    val df = records.toDF("id", "name", "age")
    // Write to Cassandra

    df.write

      .format("org.apache.spark.sql.cassandra")

      .options(Map("keyspace" -> "test_keyspace", "table" -> "test_table"))

      .mode("append")

      .save()

spark.stop() } }
Method 3: Using Data Generation Tools
You can also use data generation tools like:
Apache JMeter: You can create a test plan to generate data and insert it into Cassandra.
Mockaroo: A web-based tool that allows you to generate large datasets in various formats, including CSV, which you can then import into Cassandra.
Conclusion
Generating a billion records in Cassandra can be done using various methods, including Python scripts, Apache Spark, or data generation tools. Choose the method that best fits your environment and requirements. Always ensure that your Cassandra cluster is properly configured to handle the load, and monitor performance during the data generation process.