CS50 Week6: Problem Set, DNA

Programming/CS50 2023. 7. 21. 15:07

하버드 CS50 강의 6주차 Problem Set 과제 DNA 의 풀이를 다룹니다.
DNA 검식 결과를 바탕으로 사람을 찾아내는 과제입니다.
csv 데이터를 다루기 위해 파이썬의 판다스 라이브러리를 사용한 풀이입니다.

Task

$ python dna.py databases/large.csv sequences/5.txt
Lavender

Code

예전에 학교 수업에서 판다스를 써본 적이 있어서 이번에도 써 봤다.

import sys
import pandas as pd

def main():
    # TODO: Check for command-line usage
    if len(sys.argv) != 3:
        print("Usage python dna.py data.csv sequence.txt")
        sys.exit(1)

    # TODO: Read database file into a variable
    DNA_DB = pd.read_csv(sys.argv[1], index_col="name")

    # TODO: Read DNA sequence file into a variable
    with open(sys.argv[2], 'r') as file:
        DNA_sequence = file.read()

    # TODO: Find longest match of each STR in DNA sequence
    subsequences = DNA_DB.columns
    profile = [longest_match(DNA_sequence, subsequence) for subsequence in subsequences]

    # TODO: Check database for matching profiles
    match = DNA_DB.loc[DNA_DB.apply(lambda row: row.tolist() == profile, axis=1)].index
    if len(match) == 0:
        print("No match")
    else:
        print(match[0])

    return


def longest_match(sequence, subsequence):
    """Returns length of longest run of subsequence in sequence."""

    # Initialize variables
    longest_run = 0
    subsequence_length = len(subsequence)
    sequence_length = len(sequence)

    # Check each character in sequence for most consecutive runs of subsequence
    for i in range(sequence_length):

        # Initialize count of consecutive runs
        count = 0

        # Check for a subsequence match in a "substring" (a subset of characters) within sequence
        # If a match, move substring to next potential match in sequence
        # Continue moving substring and checking for matches until out of consecutive matches
        while True:

            # Adjust substring start and end
            start = i + count * subsequence_length
            end = start + subsequence_length

            # If there is a match in the substring
            if sequence[start:end] == subsequence:
                count += 1

            # If there is no match in the substring
            else:
                break

        # Update most consecutive matches found
        longest_run = max(longest_run, count)

    # After checking for runs at each character in seqeuence, return longest run found
    return longest_run

main()

인기포스트 MORE POST

ABOUT ME

수서곤충의 세계 수서곤충의 세계

Task

Code

티스토리툴바

인기포스트 MORE POST

ABOUT ME

Task

Code

관련글 관련글 더보기

티스토리툴바